How dangerous was 2016?

Jason Crease shows that 2016 was, indeed, a year of surprisingly many celebrity deaths. The hard part is defining “celebrity”. There had been a previous BBC analysis based on the number of deaths of people with prewritten obituaries, but that is naturally skewed towards what one particular news organization is useful. Crease’s analysis uses Wikipedia data – both the length of the article and the number of revisions. It turns out that number of revisions of the Wikipedia article is a useful metric than the length of the article – a long article can be long in part because it includes lists of relatively uncontroversial material.

Other analyses include:

  • Snopes, based on lists of notable deaths put out by various media organizations – but of course there’s probably some bias towards keeping the lists roughly the same length as in previous years.
  • Researchers at the MIT media lab, C. Candia-Castro-Vallejos, Cristian Jara-Figueroa, César A. Hidalgo, who concluded that fewer famous people died in 2016 than expected (although not many fewer) Their notion of fame attempts to be more cross-cultural and looks at the number of languages someone has a Wikipedia article in.
    (Via Metafilter.).

It may just be that the Anglo-American axis had a bad year (and of course Brexit and the ascendancy of Donald Trump can’t have helped the mood in (the media in) either of those countries…

But 2017 has a total solar eclipse in the US, so we’ll be okay.

I made it out of clay

Robert Nemiroff and Eva Nemiroff ask: are dreidels fair? Spoiler alert: Betteridge’s law of headlines applies here.

The game traditionally played with the dreidel is unfair, as Ben Blatt showed by simulation and Robert Feinerman showed analytically, but this is assuming that all four sides of the top are equally likely to come up when it is spun. The Nemiroffs took this one step further and checked whether the four sides of the dreidel are equally likely to come up.  They took three dreidels and spun them (800, 1000, and 750 times respectively) and showed that these dreidels were unfair even in this more basic sense.

Interestingly, the patterns seem to tell a story about how the dreidels the Nemiroffs used were flawed. I reproduce their Table 1 here (and yes, they had a dreidel with Christmas imagery on it…)

Driedel ג (gimel)/ Santa נ (nun)/ candy cane ש or פ (shin or pei) / tree ה (he) / snowman total spins
Old wooden 109 302 134 255 800
Cheap plastic 311 243 196 250 1000
Santa 52 275 126 297 750

The letters נ (nun) and ה (he) appear opposite each other, as do ג (gimel) and whichever of ש or פ (shin or pei) is used. So what we see here is that:

  • on the “old wooden” dreidel and the “santa” dreidel, two sides opposite each other are preferred – perhaps the dreidel is slightly wider in one direction than the other
  • on the “cheap plastic” dreidel, one side is preferred and the side opposite it is dis-preferred – perhaps the dreidel is slightly heavier on one side or the handle is slightly off-center.

Presumably dreidels are allowed to be so unfair because nobody is playing dreidel for high stakes, so there’s no real incentive to construct the things properly.

After this year, I can always divide my life into triangles

Today is my 33rd birthday. In honor of that, here are some interesting properties of 33.

One from Wikipedia’s list which I like because I have a soft spot for integer partition problems, is that it’s the largest positive integer that cannot be expressed as a sum of different triangular numbers. The others are 2, 5, 8, 12, and 23: see OEIS A053614. There’s an almost-proof of this fact in this compilation of problems from mathematical olympiad selection tests; that compliation cites this review paper of Erdos and Graham on results in combinatorial number theory, but I can’t find the result there! If I make it to 128, it’s the largest number not the sum of distinct squares.

An idea of the proof is as follows: check by enumeration that 34 through 66 can be written as the sum of distinct triangular numbers, where 66 is not used: 34 = 28 + 6, 35 = 28 + 6 + 1, 36 = 36, 37 = 36 + 1, 38 = 28 + 10, …, 66 = 55 + 10 + 1. Then add 66 to each of these to get a way of expressing 67, 68, …, 132 as a sum of distinct triangular numbers – for example 104 = 66 + 38 = 66 + 28 + 10. Add the largest triangular number less than 132 (this turns out to be 120) to each of those decompositions to write each of 133, …, 252 as such a sum. And so on.

Why is this worth singling out from the list? Many of the others include some arbitrary constant, such as:

  • “the sum of the first four positive factorials”
  • “the smallest odd repdigit that is not a prime number” (a “repdigit” is a number that consists of the same digit repeated, so the constant 10 is hiding here; inf act you could argue this is basically a strange way of stating the identity 33 = 3(10+1))

It’s also pretty cool that 33 is a Blum integer – that is, a product of two distinct primes, each of which is congruent to 3 mod 4. (But it’s not the first Blum integer – that’s 21.)

Another property of 33, which is less negative, is that it’s the first member of the first cluster of three semiprimes (33 = 3 x 11, 34 = 2 x 17, 35 = 5 x 7). That is, it’s the first member of this sequence. In OEIS terms, I’d say that being the first member of a sequence, or the last member of a sequence, is more interesting than being just out in the middle of the sequence somewhere.

The semiprime thing appears to have an arbitrary constant of 3. But there are no clusters of four or more consecutive semiprimes – out of four consecutive integers, one is divisible by 4 – so 33 is the first member of the first cluster of semiprimes of maximal length.

Want to know what’s interesting about some number? You could trawl the OEIS or Wikipedia, or you could go to Erich Friedman’s list, which is a bit more selective, only listing one property of each number. In fact both of my interesting properties of 33 appear here – the semiprime one is, for Friedman, a property of 34, “the smallest number with the property that it and its neighbors have the same number of divisors”.

Time zones and election turnout

Another bit of election analysis: When You Don’t Snooze, You Lose: A Natural Experiment on the Effect of Sleep Deprivation on Voter Turnout and Election Outcomes, working paper by John B. Holbein and Jerome P. Schafer.

People just to the east of a time zone boundary sleep 20 minutes less than those on the west side of the time zone boundary. (This is based on the American Time Use Survey.) This depresses voter turnout, which, in a US setting, moves election results to the right. (Anecdote is not data, but this year I voted early one morning in the week before Election Day – we have early voting in Georgia – because I happened to be awake anyway. So at least in my house waking up early drives turnout.) Rain also drives down voter turnout. Perhaps if you really wanted to you could blame the results in Wisconsin on rain this Election Day… but let’s not go down that rabbit hole.

For an illustration of a similar phenomenon, take a look at the Jawbone circadian rhythm map (by Tyler Nolan and Brian Wilt), which shows that people (well, Jawbone fitness tracker owners) on the eastern side of a time zone boundary go to bed later than those on the western side of the same boundary. Interestingly, they don’t see the effect in total amount of time sleeping, which suggests that in their data set people on the eastern side of a time zone boundary also wake up later.

How to flip an election

Another reason Clinton lost Michigan: Trump was listed first on the ballot, by Josh Pasek, University of Michigan. (Disclaimer: I went to middle and high school with Pasek.) From the blog post: “The best estimate of the effect of being listed first on the ballot in a presidential election is an improvement of the first-listed individual’s vote share of 0.31%.” Trump was listed first on the Michigan ballot, because the governor of Michigan is Republican. This study is based on elections in California, which randomizes the order of the candidates on the ballot by precinct. Here’s a preprint of the paper (Pasek, J., Schneider, D., Krosnick, J. A., Tahk, A., Ophir, E., & Milligan, C. (2014). Prevalence and moderators of the candidate name-order effect evidence from statewide general elections in California. Public Opinion Quarterly, 78(2), 416-439.).

Clinton also would have won if the map of the United States looked slightly different. If you want to play around with this yourself, you can redraw the states using the tool by Kevin Hayes Wilson. Move Camden County, New Jersey into Pennsylvania and Lucas County, Ohio (i. e. roll back the Toledo War, which was a thing) into Michigan, and Clinton wins.  Each of these counties is adjacent to the state it’s being moved into. Here’s the resulting map.

I’m pretty sure that two is the minimal number of counties that have to be moved to get a Clinton win, under the constraint that the counties in each state have to remain geographically contiguous. Clinton starts out needing 37 more EV. and the only way to get that by flipping just one state is to flip Texas; but no state adjacent to Texas went blue.  There is a way to make Clinton win that involves moving one county into another state – namely, move Los Angeles County, California into Texas – but that doesn’t seem to be in the spirit.)

The natural question, then, if we want to know how much “unfairness” is due to the electoral college, is something like this: given the actual voting results, and some “random” partitioning of the US into states, what is the probability of a Trump (or Clinton) win? But what does a “random” partitioning of the US into states even mean?  It seems difficult to define this, given that we don’t have a huge number of alternate histories to run, but I’d imagine we’d want to preserve facts like:

  • some states have many more people than others, but no state is much smaller in population than the average congressional district;
  • more populous states tend to be more urban (this is relevant since the electoral college helps low-population states, and one party is more represented in urban areas);
  • states are geographically relatively compact (unlike, say, Congressional districts in some states)

But in the end this is an academic question, because we don’t get to redraw the states.  (Can you imagine the gerrymandering?)

Dressing goes on salad

Someone needs to make a better stuffing vs. dressing map than this one from Butterball. The problem is that they have a small sample: the fine print reads “This survey was conducted online with a random sample 1,000 men and women in 9 regions – all members of the CyberPulseTM Advisory Panel. Research was conducted in May 2007. The overall sampling error for the survey is +/-3% at the 95% level of confidence.” So the average state has a sample of 20, which would lead to a 21% or so margin of error. This error is enough that the map just looks wrong – Georgia and Mississippi call it stuffing, but Alabama and Tennessee call it dressing?  The Butterball map does seem to capture the regional divide, though, where the South calls it “dressing” and the North calls it “stuffing”.  We’re still fighting the linguistic Civil War in my house.  Obviously this is meant to be entertainment, but get a bigger sample, will you?

It looks like Epicurious has some internal data based on search results that led to their site, but they’re not sharing.

My Google Image Search results for “stuffing vs. dressing” find a bunch of pictures of the ambiguously named bready dish, and also this map of the largest religious denomination in US counties and this article on Josh Katz’s maps of Bert Vaux’s dialect survey. “Stuffing” vs “dressing” is not one of the questions in that survey, sadly.

And yes, I know about the compromise where it’s “stuffing” when it’s cooked in the bird and “dressing” when it’s cooked separately. But in my family of origin we generally have too much to fit in the bird, so some gets cooked in the bird and some doesn’t… does that mean we have “dressing” and “stuffing” on the table at the same time?

Use R, vote D?

David Robinson, data scientist at StackOverflow, tweeted:

Of course this is because of a confounder.  Namely, R comes out of the statistics community, which is concentrated in places with universities, which also tend to be pro-Democratic in the current political environment.   Python, he finds, is also anti-correlated with Trump voting; C# and PHP are correlated with Trump voting, he finds:

Interpret this as you will.  (Seriously, I don’t know enough about who uses C# and PHP to comment anywhere near intelligently.)

The data on language usage by county is not public, but the data on voting is, David Taylor has assembled vote counts by county, and David Robinson has some code for manipulating them and making some plots. Fun fact: the county(-equivalent) with the lowest percentage of Trump voters is the one Trump doesn’t want to move to.