The journal Basic and Applied Social Psychology is banning the NHSTP. (That’s the “null hypothesis statistical testing procedure” you might remember from an intro stats course.) This includes banning confidence intervals, thanks to the duality between confidence intervals and hypothesis tests. The journal’s editors write that:
We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking.
I’ve seen the fixation on statistical significance be a big block in presenting results in business settings. I can’t count the times where I’ve had to explain, especially with “big data” sets, that just because something is statistically significant doesn’t mean it’s practically significant. How much of the cult of statistical significance comes from the choice of words? Presumably the same is true in the social science setting; at least in both cases you have people with some statistical education and who are generally used to looking at numbers but are not statisticians or otherwise specialists in a quantitative field.
However this may be a bit of an overreaction. To say “thou shalt not do X” could be just as restrictive as “thou shalt always do X”.
Nate Silver has written on what’s so great about sports data. In short: it’s rich data (not just big data), we know the rules, and feedback comes quickly.
This is from ESPN The Magazine‘s “Analytics Issue”, which comes out each year connected with the Sloan Sports Analytics Conference in Boston on Friday, February 27 and Saturday, February 28. I’ve been, back in 2013 when I was working for a sports ticketing company; a lot of interesting talks happen there. Most talks from past conferences have been posted online so it’s worth poking around if you have an interest.
As I’m writing this, tomorrow is Chinese New Year (in the US) and today is Ash Wednesday. (I suspect it’ll be a day later when you read this.) This raises a question: does Chinese New Year often fall during Lent (that is, on or after Ash Wednesday)? The coincidence creates conflicts for many Asian Catholics: see e. g. here, here, here, here.
Chinese New Year falls on the second new moon after the winter solstice.
Ash Wednesday is 46 days before Easter. (Yes, 46. I know, you thought Lent was 40 days. Sundays don’t count.) Easter is the Sunday after the first full moon on or after the spring equinox (the “Paschal full moon.)
How far apart are these? Well, there are 90 days between the winter solstice (December 22) and the spring equinox (March 21). This is between three and three-and-a-half lunar months (of 29.5 days each), so from Chinese New Year to the Paschal full moon (i. e. the full moon on or after the spring equinox) is either one-and-a-half or two-and-a-half lunar months. In the cases when it’s one-and-a-half, Ash Wednesday will fall around Chinese New Year; when it’s two-and-a-half, Ash Wednesday will be a month or so after Chinese New Year.
The short interval happens when Chinese New Year is relatively late in the window of dates it can occur, which is January 21 to February 20. In particular, if Chinese New Year is less than about 44 days (that is, one-and-a-half lunar months) before the spring equinox (March 21), then it’s 1.5 lunar months from the Paschal full moon, and we get a situation like this year’s. That is, Chinese New Year is roughly around the beginning of Lent if it’s around February 5 or later – about half the time. Not so unusual after all.
As for the day of the week – Ash Wednesday works out to be between 39 and 45 days before the Paschal full moon. Chinese New Year is about 44 days before in half of years. So Chinese New Year can only fall within the first couple days of Lent. (If you want to do the calculations: can it fall later than Thursday? I suspect this might be possible, because the two calendars work on different rules — the Chinese calendar is based on astronomical observation whereas the Christian ecclesiastical calendar is based on computations than can be done with relatively simple arithmetic.)
I just learned that Snow Day Calculator exists, and will tell you the probability of having a snow day from school tomorrow.
Here’s an interview with David Sukhin, its creator, currently a junior at MIT. It appears AccuWeather has a similar predictor. See also the reddit snow closing map.
I’d be interested to know how accurate these forecasts are. The big difficulty here seems to be, as with so many prediction problems, gathering the data set. There are good records of actual snow amounts – but school closures (the dependent variable) and historical weather forecasts (the independent variable, if we want to avoid leakage from the future) are going to be much harder to find. (An easy way to deal with this, once you have a critical mass of users – let them submit whether their school closed today or not?) There has been analysis of the effect of snow days on educational outcomes, but the only thing I could dig up on predicting the probability of snow days is this paper on defining a severity index for snow storms from the National Weather Digest in 1985, which lucked into having a high school with good records. That won’t scale.
Aaron Clauset, Samuel Arbesman, and Daniel B. Larremore have published a paper in Science Advances: Systematic inequality and hierarchy in faculty hiring networks. The long and the short of it is that you’ve got to go somewhere really good for your PhD if you want a faculty job. They develop a “prestige network” by assuming that schools rarely “hire down” — that is, generally faculty in subject X at a given university got a PhD in subject X at some better university. So to put schools in order, you find a list of schools that minimizes the number of such hires. (The number of faculty at institutions more prestigious than their doctorate ends up being about 9 to 14 percent.)
I’ve been thinking this would be a good idea for a while, but building the data set isn’t exactly easy. The supplement indicates that “all information was collected manually from public data sources on the World Wide Web, typically from the faculty member’s curriculum vitae or biography on their homepage”, and that a total of about 20k faculty (in three subjects: CS, business, and history) were in the study. The rankings are Figure S10 of the supplement.
Schools, then, work differently than people, at least according to the quote of Andre Weil as reported by Paul Halmos, from I Want to be a Mathematician: An Automathography : “André Weil suggested that there is a logarithmic law at work: first-rate people attract other first-rate people, but second-rate people tend to hire third-raters, and third-rate people hire fifth-raters.”
I can’t believe I’m posting a buzzfeed link here, but here you go: 16 Things All Data Scientists Know To Be True.
Nurses are among those most frequently injured on the job, says Daniel Zwerdling of NPR (in a long piece that’s worth reading). One of the most common sources of such injuries is lifting patients, which gets worse as we Americans get heavier.
There’s a chart that is a bit perplexing from a mathematical point of view, though. This chart claims, for example, that the weight of your head is seven percent of your body weight – regardless of that weight. (the trunk is 43 percent, each arm is 5, and each leg is 20.). These are based on “body segment parameters” from, as far as I can tell, a 1996 study by Paolo da Leva based on gamma-ray scanning by earlier researchers. The major use of this sort of work seems to be in studying how the body moves.
But I’d think, for example, that the weight of the head grows less slowly than overall weight – this comes from extensive looking at the heads of people of different weights – and other body parts more so to compensate. I don’t have a pile of cadavers or machines for scanning live subjects – any ideas?
From the Census Bureau via Slate, on the income gaps between opposite-sex married couples:
- in 3.9 percent of couples, the husband earns 5,000 to 9,999 more dollars (per year) than the wife;
- in 25.4 percent of couples, the husband earns within 4,999 dollars of the wife;
- in 2.8 percent of couples, the wife earns 5,000 to 9,999 more dollars than the husband.
(The rest of the couples have more than a 10,000-dollar differential.)
Something seems fishy here. Call the wife’s earnings and the husband’s earnings ; we’re interested here in the distribution of the random variable . (Of course it’s difficult to write out the distribution of ; we know and are correlated, by assortative mating.) The three bins above correspond to being in the intervals and . The second interval is twice as wide as the others – so we’d expect twice as many couples to be in that middle bin as the ones on either side of it.
But instead we have six to nine times as many. Any explanations? All I can think of to explain this phenomenon – if it’s real – is that there are a surprisingly large number of cases where the husband and wife do the same job (not just working at the same place, but actually doing the same thing, for the same pay)… but how many couples like that can there be? It seems more likely to be an artifact of how the survey works.
From futility closet, pointing to this entry in the Encyclopedia of Integer Sequences: numbers such that divides the number of digits of are: 1, 22, 23, 24, 266, 267, 268, 2712, 2713, 27175, 27176, 271819, 271820, 271821, 2718272, 2718273, 27182807, 27182808, 271828170, 271828171, 271828172, and so on. (For example, has digits. This supposedly comes from a column of Martin Gardner, “Factorial Oddities”, which I don’t have.
This seems a bit mysterious at first: what’s the decimal expansion of doing there? But there’s a simple explanation. Recall Stirling’s approximation: . Taking log base 10, we get . But for to have digits, we need . Thus will have digits around when . Solving for gives .
This basically all follows from the approximation .
But the numbers in that series are actually a bit below a power of 10 times ; recall 1, so if what I’d just done worked exactly we’d have 2718281 in the sequence, But we have 2718272 and 2718273, eight and nine less than that. This is because we could have used the more accurate verison of the approximation: . Thus is a slight underapproximation.
1. no, isn’t rational.