Press will buzz off, or, doubled letters at the ends of words

John D Cook wrote, following the latest episode of Kevin Stroud’s History of English podcast, that if a consonant at the end of a word is doubled, it’s probably S, L, F, or Z. It appears that some elementary school teachers teach a “doubled final consonant rule” that is roughly the converse of this: “If a short vowel word or syllable ends with the /f/, /l/, /s/, or /z/ sound, it usually gets a double f, l, s, or z at the end.”

That sounds right to me. Some other consonants seem doubleable but relatively rarely – G, N, and R came to mind, although the only words I could think of that actually end in those doubled are “egg”, “inn”, and “err” respectively. Those are probably instances of the three-letter rule, whereby “content words” tend to have at least three letters. (Walmart has a store-brand of electronics onn. (sic), as well.)

Cook writes: “These stats simply count words; I suspect the results would be different if the words were weighted by frequency.” Challenge accepted. Peter Norvig has some useful data, including “The 1/3 million most frequent words, all lowercase, with counts.”

counts = read_delim('https://norvig.com/ngrams/count_1w.txt', delim = '\t', col_names = c('word', 'count'))

counts %>% mutate(ult = str_sub(word, -1),
                  penult = str_sub(word, -2, -2)) %>%
  filter(!(ult %in% c('a', 'e', 'i', 'o', 'u'))) %>%
  mutate(double = (ult == penult)) %>%
  group_by(ult) %>% 
  summarize(double = sum(double*count), all = sum(count)) %>%
  mutate(pct_double = double/all * 100)

Result is as follows; the counts in columns “double” and “all” are numbers of words out of the “trillion-word” data set. So over one out of every four words ending in L, in ends in double-L; only 0.02% of words ending in Y end in double-Y. Norvig’s word counts come from the 2006 trillion-word data set from Alex Franz and Thorsten Brants at the Google Machine Translation Team coming from public web pages.

# A tibble: 21 × 4
   ult       double         all pct_double
   <chr>      <dbl>       <dbl>      <dbl>
 1 l     6881678966 24910774907    27.6   
 2 z       49070005   684008846     7.17  
 3 s     3928825545 84468484631     4.65  
 4 f      734823669 16788706705     4.38  
 5 x       85268118  3108416171     2.74  
 6 c      128294648  6294874635     2.04  
 7 j        7948601   390085263     2.04  
 8 b       49690347  2690064550     1.85  
 9 p       98252923  6905211199     1.42  
10 d      460147985 47335849371     0.972 
11 m       95423709 11371100313     0.839 
12 w       52402066  6908722748     0.758 
13 q        2645593   456516943     0.580 
14 t      295347877 52262740152     0.565 
15 n      238685552 49492910349     0.482 
16 v        3478513  1084201734     0.321 
17 g       51208927 19948325553     0.257 
18 r       58629294 38947533393     0.151 
19 k        8943711  8602400357     0.104 
20 h       13449679 14180781466     0.0948
21 y        7451329 35763181677     0.0208

The count for “L” is so high because of words like “all” and “will”. It turns out in this corpus my intuition about G, N, and R being plausible is spectacularly wrong. We can also get the five most common words ending in each double letter:

counts %>% mutate(ult = str_sub(word, -1),
                  penult = str_sub(word, -2, -2)) %>%
  filter(!(ult %in% c('a', 'e', 'i', 'o', 'u'))) %>%
  mutate(double = (ult == penult))  %>% filter(double) %>% group_by(ult) %>% mutate(rk = rank(-count)) %>% filter(rk <= 5) %>% summarize(paste(word, collapse = ', '))  %>% print(n = 21)

giving the following table:

# A tibble: 21 × 2
   ult   top5                                   
   <chr> <chr>                                  
 1 b     bb, phpbb, webb, bbb, cobb             
 2 c     cc, acc, gcc, fcc, icc                 
 3 d     add, dd, odd, todd, hdd                
 4 f     off, staff, stuff, diff, jeff          
 5 g     egg, gg, digg, ogg, dogg               
 6 h     hh, ahh, ahhh, ohh, hhh                
 7 j     jj, hajj, jjj, bjj, jjjj               
 8 k     kk, dkk, skk, fkk, kkk                 
 9 l     all, will, well, full, small           
10 m     mm, comm, hmm, hmmm, dimm              
11 n     inn, ann, lynn, nn, penn               
12 p     pp, app, ppp, supp, spp                
13 q     qq, qqqq, qqq, sqq, haqq               
14 r     rr, err, carr, starr, corr             
15 s     business, address, access, class, press
16 t     scott, matt, butt, tt, hewlett         
17 v     vv, vvv, rvv, vvvv, cvv                
18 w     www, ww, aww, libwww, awww             
19 x     xxx, xx, xnxx, xxxx, vioxx             
20 y     yy, yyyy, nyy, yyy, abbyy              
21 z     jazz, buzz, zz, jizz, azz  

and S, L, F, and Z do emerge as the only letters where the resulting words aren’t just junk. The rules seems to loosen for names (Matt, Scott, Hewlett; Ann, Lynn, Penn; Cobb). The data set betrays its internet origins here, with phpbb and libwww being relatively common.

I’ve found the pattern at the end of the code block above,

group_by(ult) %>% mutate(rk = rank(-count)) %>% filter(rk <= 5) %>% summarize(paste(word, collapse = ', '))

useful in practice for giving a quick summary of a table where there are many possible values of a second variable for each value of a first variable, and we just want to show which are most common.

Four points is not enough

The World Cup just started. As you may know, there are 32 teams, in eight groups of four. Each team in a group plays three games, one against each of the other teams in the group. The top two teams in each group advance to the “knockout round” of 16; teams get three points for a win, one point for a draw, and no points for a loss, so the most points a team can have is 9, the least is 0. (It’s not possible to get 8 points, but every other number is possible.)

So how many points is enough to advance?

With a quick Google I found this article from the Hindustan Times in 2018, saying that traditionally people think four points is enough, but in practice, 17 out of 33 teams with four points were in the top two in their group between 1994 and 2014. (By my count it’s 18 out of 35, which doesn’t materially impact the conclusion.). “Three points for a win” was introduced in the 1994 World Cup, so this is as far back as it’s meaningful to go.

Similarly, in the run-up to the current tournament, Fox Sports Australia writes: “Four points is really that magical mark that they need to aim at. You can miss the top two of your group with four points – 10 teams have done it across the last four World Cups – but the overwhelming majority of teams that reach that figure make it out.”

But if you stop to think about it for a moment, a win, a loss, and a draw (which is the only way to get four points) is a middling result, and you need to be in the top half to advance… this is best illustrated by 1994 group E, where all four teams got four points, and of course only two advanced.

And in Slate a couple days ago, Eric Betts wrote: “One win, one draw and one disheartening loss might be enough to get to the knockout rounds, but not necessarily. Two teams with four points advanced in 2018 and one—Iran—was sent home.” (There’s a slight error here – Argentina and Japan advanced with four points, Iran and Senegal didn’t.)

But if we go through and tabulate all 54 groups in the 1994 through 2018 World Cups (six groups in 1994, eight in each of 1998-2018) we really see that four points is not enough. Here’s a table of the teams by their rank in group and their number of points.

There’s an interesting anomaly here. Teams have not made the top two with six points – in 1994 both group D and group F had the point totals 6-6-6-0, with one team that was beaten by each of the other three, while the other three had a “cycle” of wins. (The third-place team in each of those groups advanced to the knockout round – in 1994 there were only six groups, as opposed to the current eight, so the top four third-place teams also advanced to the knockout round.). But no team has ever failed to advance with 5. (This is possible – a group where team A loses to B, C, and D, and all three games among B, C, D are ties, would have point totals 5-5-5-0.)

So you need five points to win (unless something weird happens.). Four is not enough.

When do the polls close for most people?

I tried to Google this and couldn’t – how many people live in places where the polls close tonight at time X, for each possible value of X? You can easily find a map of closing times, for example at 270towin.com:

But how many people do each of those colors represent?

The Green Papers has a nice table of poll closing times. The wrinkle is that in a state with multiple time zones, there are two possibilities:
  • polls close at the same clock time across the state. For example, in Florida, polls close at 7:00 PM everywhere in the state… as we learned in 2000 when it was called for Democrats before polls in the Panhandle (the westernmost part of the state, the only part on Central time, and a heavily Republican area) had closed. This is the method in Alaska, Florida, Idaho, Kansas, Kentucky, Michigan, North Dakota, South Dakota, and Texas.
  • polls close at the same real time across the state. This is what Nebraska does (8 Central / 7 Mountain) and Tennessee (8 Eastern / 7 Central).
We need to make some assumptions about what proportion of each split state is in each time zone. Fortunately, someone named “segacs” made a Sporcle quiz in 2015 which counted the population in each time zone worldwide, and broke down the results in a Google spreadsheet. We can just extrapolate those results forward – Tennessee was 34% Eastern time and 66% Central according to 2015 Census data, so we’ll just carry that forward to 2020. If everybody’s been leaving Knoxville and Chattanooga to move to Nashville and Memphis, we won’t know.

As it turns out, in most places the polls close at 7 or 8 local time, and those represent about equal numbers of people. The exceptions are:

  • Kentucky and Indiana (6 pm local)
  • North Carolina, Ohio, West Virginia, and Arkansas: 7:30 pm local
  • New York and North Dakota: 9 pm local. (Is there anything else New York and North Dakota have in common?)

The overall distribution is in the chart below.

And in Eastern-time terms, the distribution is:

Both of these charts and the underlying data are at this Google spreadsheet.

This should be familiar to people who make a habit of watching the election returns roll in… you get the first substantial votes at 7, a big chunk at 8, and they trickle in over the rest of the night. (In presidential years the 11:00 chunk isn’t as interesting as you’d expect from its volume – the only polls closing at 11 are California, Oregon, Washington, and small portions of North Dakota and Idaho, and if any of those states are competitive the election as a whole is not.)

Too small to show on the chart is the polls that close at 1 AM. Those are the polls that close at 8 PM (Hawaii-)Aleutian time (UTC-10, five hours behind Eastern time), in that part of the Aleutian Islands of Alaska west of 169° 30′ W longitude. In terms of populated places it looks like this is a really long-winded way of saying Adak. Adak has 326 people. The biggest settlement in the Aleutians, Unalaska, is only at 166° 32′ W and is therefore in UTC-9, “Alaska time”. Brian Brettschneider, Alaska-based climatologist, called out Adak in 2016:

https://platform.twitter.com/widgets.js

and at least a cursory look at a list of Alaska polling places suggests there are two in the Aleutians, “Aleutians No. 1” in Adak and “Aleutians No. 2” in Unalaska. It seems quite reasonable that there is only one polling place, the one in Adak, that closes at 1 AM Eastern. This oddity has been mentioned before, in 2012 and 2016, in both cases by local sources. In 2016 five people voted after 8 PM Alaska / 7 PM Aleutian (midnight Eastern).

Not that anything will be called in Alaska when the polls close… Alaska uses ranked-choice voting, so it’ll take a while to count the votes anyway.

Does Game 3 have some special magic?

Saw a “statistic” during game 3 of the World Series yesterday – teams that have won Game 3 have gone on to win 68 times out of 98 (69%).

First of all, 98 is a strange denominator there… that would be “since 1923”, but the first World Series was played in 1903! But 1922 was the last time there was a tied game in a World Series, so is this presumably 1923 through 2021 (99 years) with the exception of 1994?

You can find this statistic in, for example, this CBS Sports item or this tweet from @MLB: https://platform.twitter.com/widgets.js

Okay, turns out I misheard it – it’s the winner of all Game Threes in best-of-seven MLB postseason series when the first two games were split 1-1.

So is this surprising? Not really… the winner of such a Game Three has to win at least two out of the next four to win the series. If games are coin flips, that happens 11/16 of time (69%). Game Three doesn’t have some special magic… it’s just the a 2-1 lead is substantial.

This post should be going out at the first pitch of Game 4. Go Phillies!

How many states are south of Mexico?

The site barelybad.com asks:

How many U.S. states have any portion of their borders north of the southernmost part of Canadian land?

In short, “how many states are north of Canada?”, although this is a bit disingenuous as, say, It’s not a trick question, “Canada” is the bit you’re used to seeing on maps. Here’s the answer. Here’s a blog post with a map.

So, then, how many states are south of Mexico? More formally: how many U. S. states have any portion of their borders south of the northernmost part of Mexican land?

This one Is a bit easier, I think.

Obviously you have Hawaii.

All of the southernmost tier of states are included – California, Arizona, New Mexico, Texas, Louisiana, Mississippi, Alabama, Georgia, Florida. California is the surprising one, I think, but the border doesn’t actually run due east-west in California. The border was defined by the Treaty of Guadalupe Hidalgo (which ended the Mexican-American War) to be a line from the junction of the Colorado and Gila rivers (near Yuma, Arizona) to a point one Spanish league south of the southernmost point of San Diego Bay; that point where the Colorado and Gila meet is the northernmost point of Mexico. This point was hard to find in reality, according to Joel Levanetz of the San Diego Historical Society. The adjacent Mexican town, Los Algodones, Baja California, apparently does a thriving business in dental tourism. Google gives the coordinates of that northeastern corner as 32.71865 N, 114.71972 W.

That the eastern states have land south of this isn’t so obvious. But it happens that I know that Atlanta and Los Angeles are at about the same latitude (34 degrees north) and there’s a lot more Georgia south of Atlanta than there is California south of Los Angeles.

South Carolina and Arkansas are the only really questionable states from eyeballing a map. Below is a screenshot of the Google map of the US… note that Google uses Web Mercator so horizontal lines on the map are actually lines of latitude.

Map of US with northernmost point of Mexico circled

The southern border of Arkansas (with Louisiana) is the 33rd parallel north, so no part of Arkansas is south of any part of Mexico. See for example the Encyclopedia of Arkansas. When the border was established, it was the border between the District of Orleans (present-day Louisiana) and the District of Louisiana (present-day Arkansas).

As for South Carolina, a line heading due east through the northernmost point of Mexico passes through it. This line was drawn with the highly advanced technology of “opening the screenshot up in Preview and drawing it, holding down the Shift key to make sure the line was horizontal”:

As you can see the area below the line takes in a tiny sliver of California, portions of Arizona, New Mexico, and Texas, nearly all of Louisiana, about half of Mississippi, Alabama, and Georgia, all of Florida, and a small bit of South Carolina. The line passes just south of Charleston:

This is also a quiz on Sporcle that I played once, in 2017. I got 10 out of 11; I don’t know which one I missed.

A hat tip is due to this Reddit post which got me thinking about this.

What are you trying to optimize in Wordle?

Here’s my solution to the Wordle puzzle from yesterday, Sunday, August 7, 2022:

This is a good example of the strategy I generally follow in Wordle:
– start with CRANE, because WordleBot says it’s the best first move. (I used to start with STERN, inspired by the Wheel of Fortune starting letters RSTLNE.)
– usually guess words that are consistent with the information received from previous words. This is what Wordle calls “hard mode” but I don’t actually turn on that setting.

In this case, the E, A, and R in EARTH have different positions from those in CRANE; similarly for RELAY. That in turn forces the position of the R and the E in the fourth word – by the fourth guess the answer must be ??EAR.

So what should my fourth guess be? WordleBot, the New York Times tool for analyzing Wordle play, says that after guessing RELAY, there are three possible solutions, SMEAR, SPEAR, and SWEAR. I believed these were three possible solutions but couldn’t be sure they were the only ones.

Groups after my third guess, thanks to WordleBot

Three guesses left, three possible solutions left, so at this point I’m guaranteed to win.

But say I only had two guesses left, and I want to maximize my chance of winning. Then the optimal strategy is to guess a word that includes at least two of P, M, and W – let’s say WIMPY. WIMPY is a wimpy guess in that it’s guaranteed to be wrong, but one of W, M, and P will turn yellow, and this gives the information to get the next guess right. The strategy of guessing SMEAR, SPEAR, and SWEAR in turn has a two-thirds chance of winning.

On the other hand, say I only had one guess left. Then WIMPY has probability zero of winning; any of SMEAR, SPEAR, or SWEAR has two-thirds, so I may as well go for it.

If we ignore the arbitrary six-guess limit, and assume we’re playing to minimize the expected total number of guesses (say, because we have to pay for each guess), then it doesn’t matter what we do – either way the expected number of guesses needed is two. But Wordle collects statistics on how many times you’ve won, and doesn’t compute an average number of guesses, so the framing is really towards maximizing win probability.

There’s a sportsball analogy here. If there’s plenty of time left in the game, playing to maximize the expected number of points is probably the right move; but if there’s little time left, the strategy that maximizes the probability of winning may be different from the probability that maximizes the expected score. Examples include intentional walks in baseball, going for two-point conversions instead of extra points in American football, etc.

Also, picking Sarah Palin as your running mate. That was probably a negative-expectation move for John McCain, but he was already behind. A negative-expectation but high-variance strategy might have been the right one.

Why is a euro close to a dollar?

As I write this post 1 US dollar = 0.9914 euros. For a moment on Thursday, July 14, 2022 the US dollar was slightly above to euro, where it hasn’t been in twenty years. The chart below is from xe.com (the URL gives a chart for the previous week).

chart of USD to EUR conversion rate, July 9 to 16, 2022

This has been in the news (NPR, AP, Reuters, NYT). Of course the barrier is only psychological. But is there some reason the euro is basically in the neighborhood of 1 US dollar? This is astoundingly hard to search for, even before EUR/USD parity cam into the news.

But it turns out there is. The euro is the successor to the European currency unit (ECU), which was a currency basket used internally by the European economic community. The ECU was in turn a successor to the European Unit of Account (EUA), which was defined in 1950 to be 0.888671 grams of gold, and was redefined in terms of a basket of European currencies in the seventies.

Seems like a strange number, 0.888671 grams of gold.

It’s 1/35 troy ounce. The US dollar, under the gold standard, was convertible to gold at $35 per troy ounce.

So basically the predecessor to the euro was defined to be worth one US dollar.

Rolling the dice

From the March 11, 2022 “Riddler”:
We’re playing a game where you have to pick four whole numbers. Then I will roll four fair dice. If any two of the dice add up to any one of the numbers you picked, then you win! Otherwise, you lose.

For example, suppose you picked the numbers 2, 3, 4 and 12, and the four dice came up 1, 2, 4 and 5. Then you’d win, because two of the dice (1 and 2) add up to at least one of the numbers you picked (3).

To maximize your chances of winning, which four numbers should you pick? And what are your chances of winning?

Some first thoughts:

  • You want numbers that are common as the sums of two dice – middling numbers, numbers near seven.
  • The problem has a reflection symmetry. The dice values x_1, x_2, x_3, x_4 win with the target sums y_1, y_2, y_3, y_4 if and only if the dice values 7-x_1, 7-x_2, 7-x_3, 7-x_4 win with the target sums 14-y_1, 14-y_2, 14-y_3, 14-y_4.

Putting these together, a symmetric set of middling numbers seems likely to be the best target set – something like 5, 6, 8, 9. This is a nasty case analysis, but it’s easy to do by brute force in R.

library(tidyverse)


dice_and_targets = expand.grid(d1 = 1:6, d2 = 1:6, d3 = 1:6, d4 = 1:6,
                   t1 = 2:12, t2 = 2:12, t3 = 2:12, t4 = 2:12) %>% filter(t1 < t2 & t2 < t3 & t3 < t4)  %>% 
  mutate(s12 = d1 + d2, s13 = d1 + d3, s14 = d1 + d4,
          s23 = d2 + d3, s24 = d2 + d4, s34 = d3 + d4)

The data frame `dice_and_targets` has a row for every possible combination of dice results (d1 … d4) and targets (t1 … t4), and the sums of the dice (s12 … s34). It’s a big data frame, with 6^4 \times {11 \choose 4} = 1296 \times 330 = 427680 rows, one for each of the 1296 possible dice rolls and 330 choices of targets.

Let’s take a look at a sample of this data frame, consisting of 10 randomly selected rows:

set.seed(1)
dice_and_targets$idx = sample(nrow(dice_and_targets))
dice_and_targets %>% filter(idx <= 10) %>% select(-idx)

   d1 d2 d3 d4 t1 t2 t3 t4 s12 s13 s14 s23 s24 s34
1   1  2  3  5  2  3  5  8   3   4   6   5   7   8
2   4  1  6  5  2  3  4  9   5  10   9   7   6  11
3   2  3  4  5  3  6  7  9   5   6   7   7   8   9
4   2  1  5  6  3  5  8  9   3   7   8   6   7  11
5   1  6  4  6  2  6  9 10   7   5   7  10  12  10
6   3  4  4  6  3  6  7 11   7   7   9   8  10  10
7   2  2  4  2  4  8 10 11   4   6   4   6   4   6
8   1  2  3  5  2  3  4 12   3   4   6   5   7   8
9   2  6  6  1  2  6  8 12   8   8   3  12   7   7
10  1  6  3  6  6  8 11 12   7   4   7   9  12   9

Consider, for example, the first row. In this case we roll 1, 2, 3, and 5; the targets are 2, 3, 5, and 8; the pairwise sums are 1+2 = 3, 1+3 = 4, 1+5 = 6, 2+3 = 5, 2+5 = 7, and 2+6 = 8; and we win the game, in fact three times over, since the pairwise sums include three of the targets, namely 3, 5, and 8.

Next we can work out which rows win, leveraging some bitwise operations because how often do I get a chance to use these?

dice_and_targets = dice_and_targets %>% mutate(target_bits = bitwOr(bitwOr(2^t1, 2^t2), bitwOr(2^t3, 2^t4)))
dice_and_targets = dice_and_targets %>% 
  mutate(sum_bits = bitwOr(bitwOr(bitwOr(2^s12, 2^s13), bitwOr(2^s14, 2^s23)), bitwOr(2^s24, 2^s34)))
dice_and_targets = dice_and_targets %>% 
  mutate(win = bitwAnd(target_bits, sum_bits) > 0)

In this case target_bits has the bit corresponding to 2^t set if t is one of the targets; sum_bits has the bit corresponding to 2^s set if s is one of the pairwise sums. Then bitwAnd(target_bits, sum_bits) has a nonzero bit if and only if we have a winning combination.

Let’s look at those randomly selected rows, now with the wins figured out:

dice_and_targets %>% filter(idx <= 10) %>% select(-idx)

   d1 d2 d3 d4 t1 t2 t3 t4 s12 s13 s14 s23 s24 s34 target_bits sum_bits  win
1   1  2  3  5  2  3  5  8   3   4   6   5   7   8         300      504 TRUE
2   4  1  6  5  2  3  4  9   5  10   9   7   6  11         540     3808 TRUE
3   2  3  4  5  3  6  7  9   5   6   7   7   8   9         712      992 TRUE
4   2  1  5  6  3  5  8  9   3   7   8   6   7  11         808     2504 TRUE
5   1  6  4  6  2  6  9 10   7   5   7  10  12  10        1604     5280 TRUE
6   3  4  4  6  3  6  7 11   7   7   9   8  10  10        2248     1920 TRUE
7   2  2  4  2  4  8 10 11   4   6   4   6   4   6        3344       80 TRUE
8   1  2  3  5  2  3  4 12   3   4   6   5   7   8        4124      504 TRUE
9   2  6  6  1  2  6  8 12   8   8   3  12   7   7        4420     4488 TRUE
10  1  6  3  6  6  8 11 12   7   4   7   9  12   9        6464     4752 TRUE

In the first row, here, target_bits is 2^8+2^5+2^3+2^2 = 100101100_2 = 300 and sum_bits is 2^8+2^7+2^6+2^5+2^4+2^3 = 111111000_2 = 504. And bitwAnd(target_bits, sum_bits) (not shown) is 2^8 + 2^5 + 2^3 = 100101000_2 = 296. Since it’s greater than zero, that counts as a win.

You might get the idea that it’s impossible to lose from this sample. We got a little lucky here: mean(dice_and_targets$win) returns 0.874018. If you pick a random roll of four dice and four random targets out of 2, 3, …, 12, one of the pairwise sums will be in the target set 87% of the time.

But we want to know which target set makes a win most likely.

win_counts_by_target = dice_and_targets %>% group_by(t1, t2, t3, t4) %>% summarize(wins = sum(win)) %>% arrange(desc(wins))

head(win_counts_by_target)
> head(win_counts_by_target)

# A tibble: 6 x 5
# Groups:   t1, t2, t3 [5]
     t1    t2    t3    t4  wins
  <int> <int> <int> <int> <int>
1     4     6     8    10  1264
2     2     6     8    10  1246
3     4     6     8    12  1246
4     4     6     7     9  1238
5     5     7     8    10  1238
6     4     7     8     9  1236

There we go! And not a loop in sight.

Once I have the 4, 6, 8, 10 target set it’s easy to come up with that number 1264. Consider just the dice that show even numbers – you win if there are at least two of these and they’re not all showing 6. Similarly you win if there are at least two odd dice and they’re not all showing 1. So the losing combinations are

(1, 1, 1, 1), (1, 1, 1, 2), (1, 1, 1, 4), (1, 1, 1, 6), (1, 1, 6, 6), (1, 6, 6, 6), (3, 6, 6, 6), (5, 6, 6, 6), (6, 6, 6, 6)

and their rearrangements, of which there are 32.

Incidentally, my initial guess (5, 6, 8, 9) wasn’t bad – it wins 1228 times out of 1296, good enough for 14th place out of the 330 possible target sets. And the very worst target set? It’s (2, 3, 11, 12). No surprise there, although even that one wins 776 times out of 1296, nearly 60% of the time.

Expected number of all-Confederate World Series

The World Series starts today. Atlanta vs. Houston. This is wrong for multiple reasons:

  • the Astros are cheaters
  • the Astros are a National League team
  • as a native Philadelphian, I’m obligated to hate the Braves, even though I moved to Atlanta
  • these are warm-weather teams and part of the fun of the ridiculously late postseason is that it’s not really baseball weather, but I just went for a walk and it’s pretty nice out.

Nathaniel Rakich observed that this is the first-ever World Series between teams from the former Confederacy. This surprised me! But there are only five teams in the former Confederacy, out of 30 in MLB (29 of which are in the US). In chronological order of formation, they are

  • the Houston Astros (NL 1962-2012, AL 2013-present)
  • the Atlanta Cobb County Braves (NL 1966-present, moved from Boston)
  • the Texas (Dallas-area) Rangers (AL 1972-present, moved from Washington)
  • the Florida/Miami Marlins (NL 1993-present)
  • the Tampa Bay Devil Rays (AL 1998-present)

In particular Missouri never seceded, which matters quite a bit here because the St. Louis Cardinals have been in the World Series the second-most of any team.

First, a few words about Major League Baseball. There are currently two “leagues” comprising MLB, the National League and the American League. Each has 15 teams, of which one (the “pennant winner”) will make it to the World Series.

Organized baseball got started in the late 19th century, and its “classic” alignment of 16 teams were all in northern cities, since there were few large southern cities at the time. From 1903 to 1952 the teams were located as follows: Boston x2, Brooklyn, Chicago x2, Cleveland, Cincinnati, Detroit, New York x2, Philadelphia x2, Pittsburgh, St. Louis x2, Washington. In 1953-1972 a bunch of teams moved but since then MLB has mostly grown via expansion.

The former Confederacy is still is underrepresented in MLB – it has population of about 108 million, compared to the US population of 331 million, so it “ought” to have nine or ten teams. Or, if you’re going to argue that an MLB team has to be in a big city, nine of the thirty largest metropolitan areas are in the former Confederacy. (In order, they’re Dallas, Houston, Miami, Atlanta, Tampa, Orlando, Charlotte, San Antonio, and Austin. The first five have teams, and I believe the latter four have been thrown around as expansion candidates.) So although historically the country’s big cities may have been in the north, this is less true now.

Given the historical locations of the teams, how many all-Confederate World Series would we expect? We start counting in 1972, when the American League got its first team in the former Confederacy.

yearsNL teams in former ConfederacyNL teams totalAL teams in former confederacyAL teams total
1972-76 (5)2 (Houston, Atlanta)121 (Texas)12
1977-92 (16)212 114 (+Seattle, Toronto)
1993-97 (5)3 (+Florida)14 (+Florida, Colorado)114
1998-2012 (15)316 (+Arizona, Milwaukee)2 (+Tampa Bay)14 (+Tampa Bay, -Milwaukee)
2013-21 (9)2 (-Houston)15 (-Houston)215 (+Houston)
Table of team counts in former Confederacy and overall, by year

So for example, in each of 1972-76, the chances of both pennant winners coming from the former Confederacy were 2/12 x 1/12 = 2/144. With the current alignment it’s 2/15 x 3/15 = 6/225.

The expected number of all-Confederate World Series is

5 x 2/12 x 1/12 + 16 x 2/12 x 1/14 + 5 x 3/14 x 1/14 + 15 x 3/16 x 2/14 + 9 x 2/15 x 3/15 = 0.978

which is honestly lower than I expected! But it’s only fairly recently that there have been an appreciable number of MLB teams in this part of the country, and the fact that you need teams from both leagues to get through really keeps this number down.

Which countries are better at the Winter Olympics than the Summer Olympics?

From Reddit (posted by u/RoadyHouse): Which Olympic Games are these European countries the best?

This is a map which shades countries:

  • blue if they’ve won more gold medals in the Winter Olympics than the Summer Olympics
  • yellow if they’ve won more gold medals in the Summer Olympics than the Winter Olympics
  • red if they’ve won no gold medals

The only blue countries are Norway, Austria, Switzerland, and Liechtenstein. These certainly seem like a wintry set of countries (one of the big tourist attractions in Oslo is the ski jumping hill) but surely, say, Sweden should be on here? Or the Dutch with the speed skating? Or Canada? Do they even have summer there?

(A side note about that ski jumping hill – you can take the subway to it. But then you have to climb up a hill to get there! This is obvious in retrospect – of course the ski jumping hill would be on a hill! – but it was still exhausting. Also, they don’t really explain why ski jumping is a thing. I assume it involves young men and alcohol.)

The answer is that there are just a lot more events in the Summer Olympics than the Winter Olympics (and the Summer Olympics have been going on longer). So there have been 5,121 gold medals awarded all-time in the Summer Olympics but only 1,062 in the Winter Olympics, according to the all-time medal table at Wikipedia.

Consider for example Sweden. They’ve won 148 summer gold medals out of the total of 5,121, or 2.89% of all summer gold medals.

They’ve won 57 out of the 1,062 winter gold medals, or 5.36% of all winter gold medals.

So it’s reasonable to say that Sweden is better at the winter Olympics than the summer Olympics. If you wanted to put a number on it, 5.36%/2.89% = 185% so you could say they’re 85% better at winter than summer.

If I’ve done it right, the list of countries that are better at the winter Olympics than the summer Olympics are, in order from most: Liechtenstein (their only gold medals ever are in winter), Austria, Norway, Switzerland, Canada, Belarus, Czech Republic, Netherlands, Germany, Finland, Estonia, Sweden, South Korea, Russia, Slovakia, Croatia, East Germany, Slovenia, Latvia.

Do you like maps? Here that is as a map.

I’m not surprised that this list is so Eurocentric. The Soviet Union, West Germany, Italy, and France just barely miss it. (I haven’t made any effort to merge together the various Germanies, or deal with the Soviet Union and its various successor states.). Many Winter Olympic sports have a high barrier to entry just in terms of what facilities are available – to take an extreme example there are only fifteen luge tracks in the world – and lots of countries just don’t have enough winter to have winter sports. So this is essentially a map of rich, cold countries. As one reporter put it during the 2018 Olympics, “From a sports perspective, Norway is rich as shit“.

Mountains help too – Denmark has 48 summer gold medalists but no winter gold medalists. Maybe the Danes should take up speed skating like the Dutch.

(This post originated as a Reddit comment. Map made using mapchart.net.)