# Rolling the dice

From the March 11, 2022 “Riddler”:
We’re playing a game where you have to pick four whole numbers. Then I will roll four fair dice. If any two of the dice add up to any one of the numbers you picked, then you win! Otherwise, you lose.

For example, suppose you picked the numbers 2, 3, 4 and 12, and the four dice came up 1, 2, 4 and 5. Then you’d win, because two of the dice (1 and 2) add up to at least one of the numbers you picked (3).

To maximize your chances of winning, which four numbers should you pick? And what are your chances of winning?

Some first thoughts:

• You want numbers that are common as the sums of two dice – middling numbers, numbers near seven.
• The problem has a reflection symmetry. The dice values $x_1, x_2, x_3, x_4$ win with the target sums $y_1, y_2, y_3, y_4$ if and only if the dice values $7-x_1, 7-x_2, 7-x_3, 7-x_4$ win with the target sums $14-y_1, 14-y_2, 14-y_3, 14-y_4$.

Putting these together, a symmetric set of middling numbers seems likely to be the best target set – something like 5, 6, 8, 9. This is a nasty case analysis, but it’s easy to do by brute force in R.

library(tidyverse)

dice_and_targets = expand.grid(d1 = 1:6, d2 = 1:6, d3 = 1:6, d4 = 1:6,
t1 = 2:12, t2 = 2:12, t3 = 2:12, t4 = 2:12) %>% filter(t1 < t2 & t2 < t3 & t3 < t4)  %>%
mutate(s12 = d1 + d2, s13 = d1 + d3, s14 = d1 + d4,
s23 = d2 + d3, s24 = d2 + d4, s34 = d3 + d4)

The data frame dice_and_targets has a row for every possible combination of dice results (d1 … d4) and targets (t1 … t4), and the sums of the dice (s12 … s34). It’s a big data frame, with $6^4 \times {11 \choose 4} = 1296 \times 330 = 427680$ rows, one for each of the 1296 possible dice rolls and 330 choices of targets.

Let’s take a look at a sample of this data frame, consisting of 10 randomly selected rows:

set.seed(1)
dice_and_targets$idx = sample(nrow(dice_and_targets)) dice_and_targets %>% filter(idx <= 10) %>% select(-idx) d1 d2 d3 d4 t1 t2 t3 t4 s12 s13 s14 s23 s24 s34 1 1 2 3 5 2 3 5 8 3 4 6 5 7 8 2 4 1 6 5 2 3 4 9 5 10 9 7 6 11 3 2 3 4 5 3 6 7 9 5 6 7 7 8 9 4 2 1 5 6 3 5 8 9 3 7 8 6 7 11 5 1 6 4 6 2 6 9 10 7 5 7 10 12 10 6 3 4 4 6 3 6 7 11 7 7 9 8 10 10 7 2 2 4 2 4 8 10 11 4 6 4 6 4 6 8 1 2 3 5 2 3 4 12 3 4 6 5 7 8 9 2 6 6 1 2 6 8 12 8 8 3 12 7 7 10 1 6 3 6 6 8 11 12 7 4 7 9 12 9 Consider, for example, the first row. In this case we roll 1, 2, 3, and 5; the targets are 2, 3, 5, and 8; the pairwise sums are 1+2 = 3, 1+3 = 4, 1+5 = 6, 2+3 = 5, 2+5 = 7, and 2+6 = 8; and we win the game, in fact three times over, since the pairwise sums include three of the targets, namely 3, 5, and 8. Next we can work out which rows win, leveraging some bitwise operations because how often do I get a chance to use these? dice_and_targets = dice_and_targets %>% mutate(target_bits = bitwOr(bitwOr(2^t1, 2^t2), bitwOr(2^t3, 2^t4))) dice_and_targets = dice_and_targets %>% mutate(sum_bits = bitwOr(bitwOr(bitwOr(2^s12, 2^s13), bitwOr(2^s14, 2^s23)), bitwOr(2^s24, 2^s34))) dice_and_targets = dice_and_targets %>% mutate(win = bitwAnd(target_bits, sum_bits) > 0) In this case target_bits has the bit corresponding to $2^t$ set if $t$ is one of the targets; sum_bits has the bit corresponding to $2^s$ set if $s$ is one of the pairwise sums. Then bitwAnd(target_bits, sum_bits) has a nonzero bit if and only if we have a winning combination. Let’s look at those randomly selected rows, now with the wins figured out: dice_and_targets %>% filter(idx <= 10) %>% select(-idx) d1 d2 d3 d4 t1 t2 t3 t4 s12 s13 s14 s23 s24 s34 target_bits sum_bits win 1 1 2 3 5 2 3 5 8 3 4 6 5 7 8 300 504 TRUE 2 4 1 6 5 2 3 4 9 5 10 9 7 6 11 540 3808 TRUE 3 2 3 4 5 3 6 7 9 5 6 7 7 8 9 712 992 TRUE 4 2 1 5 6 3 5 8 9 3 7 8 6 7 11 808 2504 TRUE 5 1 6 4 6 2 6 9 10 7 5 7 10 12 10 1604 5280 TRUE 6 3 4 4 6 3 6 7 11 7 7 9 8 10 10 2248 1920 TRUE 7 2 2 4 2 4 8 10 11 4 6 4 6 4 6 3344 80 TRUE 8 1 2 3 5 2 3 4 12 3 4 6 5 7 8 4124 504 TRUE 9 2 6 6 1 2 6 8 12 8 8 3 12 7 7 4420 4488 TRUE 10 1 6 3 6 6 8 11 12 7 4 7 9 12 9 6464 4752 TRUE  In the first row, here, target_bits is $2^8+2^5+2^3+2^2 = 100101100_2 = 300$ and sum_bits is $2^8+2^7+2^6+2^5+2^4+2^3 = 111111000_2 = 504$. And bitwAnd(target_bits, sum_bits) (not shown) is $2^8 + 2^5 + 2^3 = 100101000_2 = 296$. Since it’s greater than zero, that counts as a win. You might get the idea that it’s impossible to lose from this sample. We got a little lucky here: mean(dice_and_targets$win) returns 0.874018. If you pick a random roll of four dice and four random targets out of 2, 3, …, 12, one of the pairwise sums will be in the target set 87% of the time.

But we want to know which target set makes a win most likely.

win_counts_by_target = dice_and_targets %>% group_by(t1, t2, t3, t4) %>% summarize(wins = sum(win)) %>% arrange(desc(wins))

# A tibble: 6 x 5
# Groups:   t1, t2, t3 [5]
t1    t2    t3    t4  wins
<int> <int> <int> <int> <int>
1     4     6     8    10  1264
2     2     6     8    10  1246
3     4     6     8    12  1246
4     4     6     7     9  1238
5     5     7     8    10  1238
6     4     7     8     9  1236


There we go! And not a loop in sight.

Once I have the 4, 6, 8, 10 target set it’s easy to come up with that number 1264. Consider just the dice that show even numbers – you win if there are at least two of these and they’re not all showing 6. Similarly you win if there are at least two odd dice and they’re not all showing 1. So the losing combinations are

$(1, 1, 1, 1), (1, 1, 1, 2), (1, 1, 1, 4), (1, 1, 1, 6), (1, 1, 6, 6), (1, 6, 6, 6), (3, 6, 6, 6), (5, 6, 6, 6), (6, 6, 6, 6)$

and their rearrangements, of which there are 32.

Incidentally, my initial guess (5, 6, 8, 9) wasn’t bad – it wins 1228 times out of 1296, good enough for 14th place out of the 330 possible target sets. And the very worst target set? It’s (2, 3, 11, 12). No surprise there, although even that one wins 776 times out of 1296, nearly 60% of the time.

# Expected number of all-Confederate World Series

The World Series starts today. Atlanta vs. Houston. This is wrong for multiple reasons:

• the Astros are cheaters
• the Astros are a National League team
• as a native Philadelphian, I’m obligated to hate the Braves, even though I moved to Atlanta
• these are warm-weather teams and part of the fun of the ridiculously late postseason is that it’s not really baseball weather, but I just went for a walk and it’s pretty nice out.

Nathaniel Rakich observed that this is the first-ever World Series between teams from the former Confederacy. This surprised me! But there are only five teams in the former Confederacy, out of 30 in MLB (29 of which are in the US). In chronological order of formation, they are

• the Houston Astros (NL 1962-2012, AL 2013-present)
• the Atlanta Cobb County Braves (NL 1966-present, moved from Boston)
• the Texas (Dallas-area) Rangers (AL 1972-present, moved from Washington)
• the Florida/Miami Marlins (NL 1993-present)
• the Tampa Bay Devil Rays (AL 1998-present)

In particular Missouri never seceded, which matters quite a bit here because the St. Louis Cardinals have been in the World Series the second-most of any team.

First, a few words about Major League Baseball. There are currently two “leagues” comprising MLB, the National League and the American League. Each has 15 teams, of which one (the “pennant winner”) will make it to the World Series.

Organized baseball got started in the late 19th century, and its “classic” alignment of 16 teams were all in northern cities, since there were few large southern cities at the time. From 1903 to 1952 the teams were located as follows: Boston x2, Brooklyn, Chicago x2, Cleveland, Cincinnati, Detroit, New York x2, Philadelphia x2, Pittsburgh, St. Louis x2, Washington. In 1953-1972 a bunch of teams moved but since then MLB has mostly grown via expansion.

The former Confederacy is still is underrepresented in MLB – it has population of about 108 million, compared to the US population of 331 million, so it “ought” to have nine or ten teams. Or, if you’re going to argue that an MLB team has to be in a big city, nine of the thirty largest metropolitan areas are in the former Confederacy. (In order, they’re Dallas, Houston, Miami, Atlanta, Tampa, Orlando, Charlotte, San Antonio, and Austin. The first five have teams, and I believe the latter four have been thrown around as expansion candidates.) So although historically the country’s big cities may have been in the north, this is less true now.

Given the historical locations of the teams, how many all-Confederate World Series would we expect? We start counting in 1972, when the American League got its first team in the former Confederacy.

So for example, in each of 1972-76, the chances of both pennant winners coming from the former Confederacy were 2/12 x 1/12 = 2/144. With the current alignment it’s 2/15 x 3/15 = 6/225.

The expected number of all-Confederate World Series is

5 x 2/12 x 1/12 + 16 x 2/12 x 1/14 + 5 x 3/14 x 1/14 + 15 x 3/16 x 2/14 + 9 x 2/15 x 3/15 = 0.978

which is honestly lower than I expected! But it’s only fairly recently that there have been an appreciable number of MLB teams in this part of the country, and the fact that you need teams from both leagues to get through really keeps this number down.

# Which countries are better at the Winter Olympics than the Summer Olympics?

From Reddit (posted by u/RoadyHouse): Which Olympic Games are these European countries the best?

This is a map which shades countries:

• blue if they’ve won more gold medals in the Winter Olympics than the Summer Olympics
• yellow if they’ve won more gold medals in the Summer Olympics than the Winter Olympics
• red if they’ve won no gold medals

The only blue countries are Norway, Austria, Switzerland, and Liechtenstein. These certainly seem like a wintry set of countries (one of the big tourist attractions in Oslo is the ski jumping hill) but surely, say, Sweden should be on here? Or the Dutch with the speed skating? Or Canada? Do they even have summer there?

(A side note about that ski jumping hill – you can take the subway to it. But then you have to climb up a hill to get there! This is obvious in retrospect – of course the ski jumping hill would be on a hill! – but it was still exhausting. Also, they don’t really explain why ski jumping is a thing. I assume it involves young men and alcohol.)

The answer is that there are just a lot more events in the Summer Olympics than the Winter Olympics (and the Summer Olympics have been going on longer). So there have been 5,121 gold medals awarded all-time in the Summer Olympics but only 1,062 in the Winter Olympics, according to the all-time medal table at Wikipedia.

Consider for example Sweden. They’ve won 148 summer gold medals out of the total of 5,121, or 2.89% of all summer gold medals.

They’ve won 57 out of the 1,062 winter gold medals, or 5.36% of all winter gold medals.

So it’s reasonable to say that Sweden is better at the winter Olympics than the summer Olympics. If you wanted to put a number on it, 5.36%/2.89% = 185% so you could say they’re 85% better at winter than summer.

If I’ve done it right, the list of countries that are better at the winter Olympics than the summer Olympics are, in order from most: Liechtenstein (their only gold medals ever are in winter), Austria, Norway, Switzerland, Canada, Belarus, Czech Republic, Netherlands, Germany, Finland, Estonia, Sweden, South Korea, Russia, Slovakia, Croatia, East Germany, Slovenia, Latvia.

Do you like maps? Here that is as a map.

I’m not surprised that this list is so Eurocentric. The Soviet Union, West Germany, Italy, and France just barely miss it. (I haven’t made any effort to merge together the various Germanies, or deal with the Soviet Union and its various successor states.). Many Winter Olympic sports have a high barrier to entry just in terms of what facilities are available – to take an extreme example there are only fifteen luge tracks in the world – and lots of countries just don’t have enough winter to have winter sports. So this is essentially a map of rich, cold countries. As one reporter put it during the 2018 Olympics, “From a sports perspective, Norway is rich as shit“.

Mountains help too – Denmark has 48 summer gold medalists but no winter gold medalists. Maybe the Danes should take up speed skating like the Dutch.

(This post originated as a Reddit comment. Map made using mapchart.net.)

# Products equaling sums

From The Riddler, June 25, 2021, a puzzle from Matt Enlow (which was originally published in Math Horizons, although I can’t find the original):

A bag contains 100 marbles, and each marble is one of three different colors. If you were to draw three marbles at random, the probability that you would get one of each color is exactly 20 percent.

How many marbles of each color are in the bag?

It doesn’t seem like there’s enough information, but there is. I recommend not trying to solve this puzzle in your head while navigating the day care parking lot, though.

Let’s say the numbers of the three colors are x, y and z. Then the number of sets of three marbles which contain one of each color is just xyz. And the number of total sets of three marbles is ${100 \choose 3}$. So we need to find positive integers x, y, z such that x + y + z = 100 and $5xyz = {100 \choose 3}.$

It’ll help to find the prime factorization of ${100 \choose 3}$, so we expand it:

${100 \choose 3} = {100 \times 99 \times 98 \over 3 \times 2 \times 1} = {2^2 \times 5^2 \times 3^2 \times 11 \times 2 \times 7^2 \over 3 \times 2} = 2^2 \times 3 \times 5^2 \times 7^2 \times 11.$

Dividing out the 5, we need $xyz = 2^2 \times 3 \times 5 \times 7^2 \times 11.$. The three big factors – the two 7s and the 11 – look like they’ll be worrisome. So let’s spread them out: let x and y both be multiples of 7, and let z be a multiple of 11. To get them to add up to 100, we need to find a multiple of 7 and a multiple of 11 that add up to 100; it’s easy to find 44 and 56. So z = 44.

That leaves $xy = 3 \times 5 \times 7^2$, so x and y must be 21 and 35.

Alternately, we can write some code to solve it, but what fun is that?

n = 100

target = (n * (n-1) * (n-2))/6
for x in range(1, n+1):
for y in range(x, int((n-x)/2)+1):
z = n-x-y;
if target == 5*x*y*z:
print(x, y, z)


which prints out

21 35 44

as expected.

But now that we have some code, is there something special about 100? Let’s iterate over the first 1000 integers:

for n in range(1, 1000):
target = (n * (n-1) * (n-2))/6
for x in range(1, n+1):
for y in range(x, int((n-x)/2)+1):
z = n-x-y;
if target == 5*x*y*z:
print(n, x, y, z)


returns

6 1 1 4
10 2 2 6
22 4 7 11
26 5 8 13
27 5 9 13
35 7 11 17
36 7 12 17
40 8 13 19
46 11 12 23
56 11 21 24
65 13 24 28
76 19 20 37
100 21 35 44
126 31 35 60
330 82 94 154
345 86 98 161
352 78 120 154
352 91 96 165
406 101 116 189
436 109 124 203
497 124 142 231
512 128 146 238
737 161 268 308


The sequence 6, 10, 22, 26, 27, 35… isn’t in the OEIS as of this writing, although some other Riddler puzzles have made it; it may be there later. These are all numbers where not just n has a lot of factors, but also n-1 and n-2 – for example 345 = 3 x 5 x 23, 344 = 2 x 2 x 2 x 43, 343 = 7 x 7 x 7. Thus ${n \choose 3}/5$ will have many factors, and many ways to be written as a product of three integers, making it more likely that one of those will have the right sum.

I had been expecting an algebraically cleaner solution, because this reminded me of another puzzle: a bag has r red and b blue balls in it. Find r, b such that the probability that two balls picked at random have different colors is exactly one-half. In this case similar code returns the solution

4 1 3
9 3 6
16 6 10
25 10 15
36 15 21
49 21 28
64 28 36
81 36 45
100 45 55

from which we can guess that the solutions are $r = (n^2-n)/2, b = (n^2+n)/2$.

# Two dice problems

Nick Berry asks: you roll a pair of six-sided dice and sum that to get a total. Your good friend does the same. What are the chances that you both get the same total?

My instinct is to say that it should be somewhere between 1/6 and 1/11 – you’d get 1/6 if the question were about one die and 1/11 if it were about picking uniformly a number from 2, 3, …, 12.

Let’s generalize. What if you roll two n-sided dice? In that case, the probabilities of getting 2, 3, …, n+1 are $1/n^2, 2/n^2, \ldots, n/n^2$, and the probabilities of getting n+2, n+3, …, 2n are $(n-1)/n^2, (n-2)/n^2, \cdots, 1/n^2$ – generalizing the well-known triangular distribution. The probability that you and your friend both roll 2, then, is

$\displaystyle \left( {1 \over n^2} \right)^2 + \left( {2 \over n^2} \right)^2 + \cdots + \left( {n-1 \over n^2} \right)^2 + \left( {n \over n^2} \right)^2 + \left( {n-1 \over n^2} \right)^2 + \cdots + \left( {1 \over n^2} \right)^2$

or, putting everything over a common denominator,

$\displaystyle {1^2 + 2^2 + \cdots + (n-1)^2 + n^2 + (n-1)^2 + \cdots + 1^2 \over n^4}$

At this point we can already see that there are order-of-n terms in the numerator, of order $n^2$, so we should get a result of order 1/n. We can peservere and do the algebra and we get

$\displaystyle {2n^3 + n \over 3n^4}$

or just a hair over 2/3n.

Berry also asks what happens if you have more dice. For k six-sided dice he gives the results of a Monte Carlo simulation which has roughly $1/{6\sqrt{k}}$. He writes that “the percentage chance of a match score falls off a lot slower than I would have predicted.”

Why does it fall off so slowly? The mean result from a single die is 7/2, with variance 35/12. So the mean result from k dice is 7k/2, with variance 35k/12 – so only values within a few multiples of $\sqrt{35k/12}$ are possible with any appreciable frequency. There are on the order of a few square roots of k of these, so the answer should be $1/\sqrt{Ck}$ for some $k$. Like Berry, I am too lazy to find the constant $C$.

Edited to add: Matthew Aldridge gives in the comments an argument (for approximating that for $k$ six-sided dice, with $k$ large, the probability of both getting the same total is approximately $1/\sqrt{pi \times 35k/3}$, by approximating both sums as normal and applying the central limit theorem. This is about $1/(6.054 \sqrt {k})$. If we bear in mind that the variance of a single roll of an n-sided die is $(n^2-1)/12$, then we get $1/\sqrt{pi/3 \times (n^2-1)k}$ for rolling k n-sided dice.

This approximation is surprisingly good even for k = 1, where it has no business being good. It gives $1/\sqrt{pi/3 \times (n^2-1)}$. This reduces to 1/n, the correct answer, if you let $\pi = 3$ and ignore the -1.

# Rational approximations of logs of small primes

Let’s say for some reason we need to approximate decimal logarithms of some small primes. We know $2^{10} \approx 10^3$, so clearly $log 2 \approx 3/10$. (All logarithms in this post are to base 10.) Can we approximate the logs of any other primes this way?

When I run through powers of small primes, the target that jumps out at me is

$7^4 = 2401 ~ 2400 = 2^5 3^1 5^2$.

Taking logs of both sides gives

$4 L_7 \approx 5 L_2 + L_3 + 2 L_5$

where $L_x = \log_{10} x$. And of course we have $L_2 + L_5 = 1$, so this relation becomes

$4L_7 \approx 2 + 3 L_2 + L_3$.

Are there other such relations? It turns out that there are finitely many numbers n such that n and n+1 have all their prime factors < 7 – apparently a conseuqence of the abc conjecture – so I just take the largest ones and derive relations like these. So I can derive similar relations from $\log 224 \approx \log 225$ and $\log 4374 \approx \log 4375$, namely

$7 L_2 + L_7 \approx 2 L_3 + 2$

and

$5 L_2 + 7 L_3 \approx 4 + L_7$

This is a system of three (approximate) equations in three unknowns, and solving it in the usual way gives

$L_2 = 72/239 \approx 0.30125, L_3 = 114/239 \approx 0.47699, L_7 = 202/239 \approx 0.84519$.

while the true values are 0.30103, 0.47712, and 0.84510 – so this is good enough for three-place accuracy.

In practice, who wants to have a rational with 239 in the denominator as an approximation? More practical is the following:

• remember $2^{10} \approx 10^3$ (from familiar powers of two), or, taking 120th roots, $2^{1/12} \approx 10^{1/40}$.
• remember $2^{19} \approx 3^{12}$ (a fundamental fact of musical temperament).

That gives $\log 2 \approx 3/10$ right away, $\log 3 \approx 19/12 \log 2 = 19/40$, and then $\log 2400 = 2 + 3 \log 2 + \log 3 = 3.375$ so $\log 7 \approx 3.375/4 = 0.84375$. This is essentially I. J. Good’s singing logarithms, which exploits the fact that a lot of rational numbers with small numerator and denominator can be approximated as powers of $2^{1/12}$, the fact on which our usual musical tuning system is based: $2^{7/12} \approx 3/2, 2^{4/12} \approx 5/4$, and a bit less accurately $2^{10/12} \approx 7/4$. (Musically these are the facts that the perfect fifth, major third, and minor seventh are the ratios 3/2, 5/4, and 7/4.).

From the last we can derive $2^{34/12} \approx 7$, which gives $\log 7 \approx 34/40 = 0.85$. This is a less good approximation than those of $\log 3$ and $\log 5$, just as the seventh harmonic isn’t really a note in the equal-tempered scale – the harmonic or “barbershop” seventh is noticeably flat compared to the minor seventh.

# Objects in my house with icosahedral symmetry

Somehow toys just show up in this house, a phenomenon which I think is familiar to many parents. Some of them have icosahedral symmetry. Like this one.

You can see that the designer was outlining the vertex of an icosahedron with that five-pointed star shape, but they couldn’t quite commit – the star points don’t actually point to the next vertex! (And no, they don’t turn.) You can get a better sense of the stars corresponding to the vertices of an icosahedron here:

There’s also this one, which is softer and has a couple of distinguished vertices antipodal to each other:

And this one which I bought for myself years ago, presumably in some sort of store that sold housewares. (Remember stores?)

The nice thing about this one is that you can see through it, which makes for some interesting photographic possibilities, such as this view where two antipodal vertices are aligned:

and this view with that emphasizes a threefold rotational symmetry:

Of course we have a soccer ball somewhere. You know what a soccer ball looks like, I’m not taking a picture.

Finally, the Twitter feed for this blog has as its icon an icosahedron, which I got from Wikimedia Commons:

I also use this as an avatar in various work systems that need one – these generally require small pictures and a face wouldn’t show up well, and it’s easier to pick out than the default in a lot of these systems which is just someone’s initials in a circle.

I do not yet have an icosahedron as a tattoo, but I’ve liked it for a while. If I were to get a tattoo it would be either an icosahedron or the diagram from Byrne’s rendition of Euclid’s proof of the Pythagorean theorem. (The link goes to Nicolas Rougeux’s interactive enhancement of the same.)

# STOP POTS POST. SPOT OPTS TOPS.

I had thought, for a few decades now, that STOP was the four-letter word with the most anagrams, with six: STOP itself, POST, POTS, TOPS, OPTS, SPOT. So of course when Josh Millard put out these STOP permutations signs, I had to buy one. It’s a limited edition of 24 stop sign prints, one for each permutation. (I opted for OPTS. As of this writing there are 18 still available; the 6 that have been bought are the five anagrams of STOP other than STOP itself, and SOTP.)

But then I had to check that claim. Peter Norvig has, meant to accompany a chapter on NLP, some word lists, of which I’ve used the enable1.txt list before for word puzzles. (I’m not sure who compiled this list.) We can put words into a canonical form by alphabetizing the letters – for example michael becomes acehilm, and stop becomes opst. Scrabble players call this an alphagram. Then to find the four-letter word with the most anagrams is just a matter of counting.

library(tidyverse)
words = read_csv(url('https://norvig.com/ngrams/enable1.txt'), col_names = FALSE)
colnames(words) = 'word'
alphabetize_word = function(w){paste(sort(strsplit(w, '')[[1]]), collapse = '')}
words$alphagram = sapply(words$word, alphabetize_word)
words$len = nchar(words$alphagram)
alphagram_counts = words %>% group_by(alphagram, len) %>%
summarize(n = n(), anagrams = paste0(word, collapse = ', '))
alphagram_counts %>% filter(len == 4) %>% arrange(desc(n))


And here are the four-letter words with the most anagrams:

alphagram_counts %>% filter(len == 4) %>% select(-len) %>% arrange(desc(n))

# A tibble: 2,655 x 3
alphagram     n anagrams
<chr>     <int> <chr>
1 aest          8 ates, east, eats, etas, sate, seat, seta, teas
2 aers          7 ares, arse, ears, eras, rase, sear, sera
3 ailr          7 aril, lair, lari, liar, lira, rail, rial
4 astw          6 staw, swat, taws, twas, wast, wats
5 opst          6 opts, post, pots, spot, stop, tops
6 ostw          6 stow, swot, tows, twos, wost, wots
7 aeht          5 eath, haet, hate, heat, thae
8 aels          5 ales, lase, leas, sale, seal
9 aelt          5 late, tael, tale, teal, tela
10 aelv          5 lave, leva, vale, veal, vela
# … with 2,645 more rows

No! Child me was wrong!

But wait! What is “seta”? Is “ates” really a thing – you can’t pluralize a verb like that! (“ate” appears to be Tagalog for “older sister”.) Perhaps the aers set, with seven anagrams, wins, but “sera” is technical (plural of serum), and as an American I have trouble recognizing “rase” as a legitimate spelling of “raze”. “lari” is a unit of money in Georgia (Tbilisi, not Atlanta) which I was unfamiliar with. And so on.

Fortunately Norvig also has a list of word frequencies (count_1w.txt), of the 332,202 most common words in a trillion-word corpus. (One of the perks of working at Google, I assume.) So we can read that in.

freqs = read_delim(url('https://norvig.com/ngrams/count_1w.txt'),delim = '\t', col_names = FALSE)
colnames(freqs) = c('word', 'freq')


The most common words are the ones you’d expect. (2.3% of words are “the”.)

> head(freqs)
# A tibble: 6 x 2
word         freq
<chr>       <dbl>
1 the   23135851162
2 of    13151942776
3 and   12997637966
4 to    12136980858
5 a      9081174698
6 in     8469404971

And the least common words are… barely words. (I don’t know the full story behind this dataset.) So it seems reasonable that all “real” words will be here.

> tail(freqs)
# A tibble: 6 x 2
word     freq
<chr>   <dbl>
1 goofel  12711
2 gooek   12711
3 gooddg  12711
4 gooblle 12711
5 gollgo  12711
6 golgw   12711

Now we can attach frequencies to the words. There are too many words in the sets for a table to be nice, so we switch to plots.

words %>% left_join(alphagram_counts)  %>%
filter(len ==  4 & n >= 6)  %>%
left_join(freqs) %>% arrange(alphagram, desc(freq)) %>%
select(alphagram,  word, freq) %>% group_by(alphagram) %>%
mutate(rk = rank(desc(freq))) %>%
ggplot() + geom_line(aes(x=rk,  y=log(freq/10^12, 10), group = alphagram, color = alphagram)) +
scale_x_continuous('rank within alphagram set', breaks  = 1:8, minor_breaks = c()) +
scale_y_continuous('log_10 of word frequency', breaks = -8:-3, minor_breaks  = c()) +
theme_minimal() + geom_text(aes(x=rk, y=log(freq/10^12, 10), color = alphagram, label = word)) +
ggtitle('Frequency of four-letter words with six or more anagrams')


And if we plot the frequency of each word against its rank in its own anagram set

then we can see that the STOP set consists of much more common words than any of the others. (STOP isn’t even the most common of its own anagrams, which surprises me – that honor goes to POST. But when I was a small child STOP seemed much more common, because of the signs.) I’m surprised to see SERA so high; this is either an extremely technical corpus or (more likely) contamination from Spanish.

And here’s a similar plot for five letters. Here I’d thought the word with the most anagrams was LEAST (among “common” words, 6: TALES, STEAL, SLATE, TESLA, STALE) but it looks like SPARE wins with room to spare, even if you don’t buy that APRES is an English word.

# How many borders between states are there in the United States?

How many borders between states are there in the United States?

Sure, you could get out a map and count them. Or you could estimate.

There are 48 contiguous states. The average state has six borders [citation needed], so that’s 288 borders, but we double-counted, so that’s 144. But we need to apply a bit of a haircut for those states that are around the edge. How many of those are there? Figure the US is roughly a 5-by-10 rectangle of states, so there are 30 states around the edge. 144 minus 30 is 114.

There are actually 109. In 1998 Thomas Holmes constructed a data set of those borders for a paper, The Effect of State Policies on the Location of Industry: Evidence from State Borders. I haven’t read the paper. It appears that it shows that there was more manufacturing activity on the “pro-business” (anti-union, has so-called right-to-work laws) side of a state border than on the “anti-business” (pro-union, doesn’t have so-called right-to-work laws) of the state border.

This method could probably also be applied with, say, mask mandates and COVID case rates. Early on in the pandemic there was some coverage of how Tennessee was doing much worse than Kentucky, although that may have been overly politicized (Kentucky has a Democratic governor, Tennessee a Republican) and may have been due to higher testing rates in Tennessee. (See Andrew Gelman’s post on the topic; it appears that data on deaths didn’t show the same gap.)

Some people like counting the borders they’ve crossed, as in this post at Twelve Mile Circle. That post includes a map by Jon Persky that gives 138 borders, but that includes 16 land crossings between contiguous US states and Canadian provinces; 8 between US states and Mexican states; two between Alaska and Canadian provinces; and three borders that can only be crossed by water (Maine – Nova Scotia, New York – Rhode Island, and Ohio – Ontario).

As for that fact that “the average state has six borders”, this is really a statement about planar graphs. From the map of the US, construct a planar graph by taking the 48 states as vertices and the state borders as edges. (You have a problem at Four Corners, which we’ll ignore.) Let E be the number of edges in the graph, and F its number of faces. Here a “face” corresponds to a place where three states meet, such as Pennsylvania-Maryland-Delaware or Georgia-Alabama-Tennessee. Then every edge meets two faces and, except for around the perimeter of the graph, every face has three edges, and thus $2E \approx 3F$. Euler proved $V - E + F = 2$, which we’ll approximate as $V - E + F \approx 0$. Thus $V - E + 2E/3 \approx 0$, or rearranging $E \approx 3V$; the number of edges (state borders) is about three times the number of states.

This is all brought to you by Colin Beveridge’s kids asking the same question about national borders.

# Riddler: a grid of numbers

From FiveThirtyEight’s “Riddler” feature: fill in this table with single-digit numbers, such that the entries in the margins are the products of the corresponding entries in the interior of the table.

We start by factorizing the column products, to get $2^6 3^4 5^1 7^3$, $2^7 5^2 7^2$, and $2^1 3^4 7^3$ respectively. Since the second-column product isn’t divisible by 3, the second column must consist of only 1, 2, 4, 5, 7, and 8. The third column isn’t divisible by 4 or 5 so it can’t contain 4, 5, or 8; furthermore it contains only a single even number (2 or 6).

We can explicitly enumerate the possibilities for each row. For example for the row with product 84 we have

expand.grid(1:9, 1:9, 1:9) %>% filter(8890560 %% Var1 == 0 & 156800 %% Var2 ==0 & 55566 %% Var3 == 0) %>%
mutate(prod = Var1 * Var2 * Var3) %>% filter(prod == 84)

which returns the data frame

  Var1 Var2 Var3 prod
1    6    7    2   84
2    7    4    3   84
3    4    7    3   84
4    7    2    6   84
5    2    7    6   84
6    6    2    7   84
7    3    4    7   84

and so there are seven possibilities for this row. 7, 6, 2 doesn’t appear because the second column can’t contain a multiple of three.

This is actually enough to fill in a few entries, and we also can list all the possibilities for the remaining ones:

Now the first column has product $2^6 3^4 5^1 7^3$, so it must have a single 5 and three 7s. The remaining four entries have to multiply to $2^6 3^4$ so they must be two 9s and two 8s. That lets us complete the first column, because there is only two possible locations for a 9, two for an 8, and one for a 5. And knowing those first-column values allows us to complete various rows:

The third column only contains a single even number, which is enough to finish that column, and then work out the second column by arithmetic:

I was honestly surprised this puzzle was solvable – I didn’t believe there was enough information at first. I think it works out because the first-column product 8890560 is large enough that we can determine uniquely what the values in the column are and only have to put them in order; the third-column only having one even value works as well.

Also, I believe this puzzle was part of the MIT Mystery Hunt that took place this weekend (which I haven’t competed in in a Very Long Time). The Riddler column was named “Can You Hunt For The Mysterious Numbers?” and it says the puzzle was by Barbara Yew, and googling that “name” finds an MIT web page at yewlabs.mit.edu for something called “MYST2021: Maturing Young Scientific Theories: Expanding Reality & You” – the first letters spell “MYSTERY”.