A bag contains 100 marbles, and each marble is one of three different colors. If you were to draw three marbles at random, the probability that you would get one of each color is *exactly* 20 percent.

How many marbles of each color are in the bag?

It doesn’t seem like there’s enough information, but there is. I recommend not trying to solve this puzzle in your head while navigating the day care parking lot, though.

Let’s say the numbers of the three colors are *x, y* and *z.* Then the number of sets of three marbles which contain one of each color is just *xyz*. And the number of total sets of three marbles is . So we need to find positive integers *x, y, z* such that *x *+ *y* + *z *= 100 and

It’ll help to find the prime factorization of , so we expand it:

Dividing out the 5, we need . The three big factors – the two 7s and the 11 – look like they’ll be worrisome. So let’s spread them out: let *x* and *y* both be multiples of 7, and let *z* be a multiple of 11. To get them to add up to 100, we need to find a multiple of 7 and a multiple of 11 that add up to 100; it’s easy to find 44 and 56. So *z *= 44.

That leaves , so *x* and *y* must be 21 and 35.

Alternately, we can write some code to solve it, but what fun is that?

```
n = 100
target = (n * (n-1) * (n-2))/6
for x in range(1, n+1):
for y in range(x, int((n-x)/2)+1):
z = n-x-y;
if target == 5*x*y*z:
print(x, y, z)
```

which prints out

`21 35 44`

as expected.

But now that we have some code, is there something special about 100? Let’s iterate over the first 1000 integers:

```
for n in range(1, 1000):
target = (n * (n-1) * (n-2))/6
for x in range(1, n+1):
for y in range(x, int((n-x)/2)+1):
z = n-x-y;
if target == 5*x*y*z:
print(n, x, y, z)
```

returns

```
6 1 1 4
10 2 2 6
22 4 7 11
26 5 8 13
27 5 9 13
35 7 11 17
36 7 12 17
40 8 13 19
46 11 12 23
56 11 21 24
65 13 24 28
76 19 20 37
100 21 35 44
126 31 35 60
330 82 94 154
345 86 98 161
352 78 120 154
352 91 96 165
406 101 116 189
436 109 124 203
497 124 142 231
512 128 146 238
737 161 268 308
```

The sequence 6, 10, 22, 26, 27, 35… isn’t in the OEIS as of this writing, although some other Riddler puzzles have made it; it may be there later. These are all numbers where not just n has a lot of factors, but also n-1 and n-2 – for example 345 = 3 x 5 x 23, 344 = 2 x 2 x 2 x 43, 343 = 7 x 7 x 7. Thus will have many factors, and many ways to be written as a product of three integers, making it more likely that one of those will have the right sum.

I had been expecting an algebraically cleaner solution, because this reminded me of another puzzle: a bag has *r *red and *b *blue balls in it. Find *r, b* such that the probability that two balls picked at random have different colors is exactly one-half. In this case similar code returns the solution

```
4 1 3
9 3 6
16 6 10
25 10 15
36 15 21
49 21 28
64 28 36
81 36 45
100 45 55
```

from which we can guess that the solutions are .

]]>My instinct is to say that it should be somewhere between 1/6 and 1/11 – you’d get 1/6 if the question were about one die and 1/11 if it were about picking uniformly a number from 2, 3, …, 12.

Let’s generalize. What if you roll two *n*-sided dice? In that case, the probabilities of getting 2, 3, …, n+1 are , and the probabilities of getting n+2, n+3, …, 2n are – generalizing the well-known triangular distribution. The probability that you and your friend both roll 2, then, is

or, putting everything over a common denominator,

At this point we can already see that there are order-of-n terms in the numerator, of order , so we should get a result of order 1/n. We can peservere and do the algebra and we get

or just a hair over 2/3n.

Berry also asks what happens if you have more dice. For k six-sided dice he gives the results of a Monte Carlo simulation which has roughly . He writes that “the percentage chance of a match score falls off a lot slower than I would have predicted.”

Why does it fall off so slowly? The mean result from a single die is 7/2, with variance 35/12. So the mean result from k dice is 7k/2, with variance 35k/12 – so only values within a few multiples of are possible with any appreciable frequency. There are on the order of a few square roots of k of these, so the answer should be for some . Like Berry, I am too lazy to find the constant .

**Edited to add: **Matthew Aldridge gives in the comments an argument (for approximating that for six-sided dice, with $k$ large, the probability of both getting the same total is approximately , by approximating both sums as normal and applying the central limit theorem. This is about . If we bear in mind that the variance of a single roll of an n-sided die is , then we get for rolling k n-sided dice.

This approximation is surprisingly good even for k = 1, where it has no business being good. It gives . This reduces to 1/n, the correct answer, if you let and ignore the -1.

]]>When I run through powers of small primes, the target that jumps out at me is

.

Taking logs of both sides gives

where . And of course we have , so this relation becomes

.

Are there other such relations? It turns out that there are finitely many numbers n such that n and n+1 have all their prime factors < 7 – apparently a conseuqence of the abc conjecture – so I just take the largest ones and derive relations like these. So I can derive similar relations from and $\log 4374 \approx \log 4375$, namely

and

This is a system of three (approximate) equations in three unknowns, and solving it in the usual way gives

.

while the true values are 0.30103, 0.47712, and 0.84510 – so this is good enough for three-place accuracy.

In practice, who wants to have a rational with 239 in the denominator as an approximation? More practical is the following:

- remember (from familiar powers of two), or, taking 120th roots, .
- remember (a fundamental fact of musical temperament).

That gives right away, , and then so . This is essentially I. J. Good’s singing logarithms, which exploits the fact that a lot of rational numbers with small numerator and denominator can be approximated as powers of , the fact on which our usual musical tuning system is based: , and a bit less accurately . (Musically these are the facts that the perfect fifth, major third, and minor seventh are the ratios 3/2, 5/4, and 7/4.).

From the last we can derive , which gives . This is a less good approximation than those of and , just as the seventh harmonic isn’t really a note in the equal-tempered scale – the harmonic or “barbershop” seventh is noticeably flat compared to the minor seventh.

]]>You can see that the designer was outlining the vertex of an icosahedron with that five-pointed star shape, but they couldn’t quite commit – the star points don’t actually point to the next vertex! (And no, they don’t turn.) You can get a better sense of the stars corresponding to the vertices of an icosahedron here:

There’s also this one, which is softer and has a couple of distinguished vertices antipodal to each other:

And this one which I bought for myself years ago, presumably in some sort of store that sold housewares. (Remember stores?)

The nice thing about this one is that you can see through it, which makes for some interesting photographic possibilities, such as this view where two antipodal vertices are aligned:

and this view with that emphasizes a threefold rotational symmetry:

Of course we have a soccer ball somewhere. You know what a soccer ball looks like, I’m not taking a picture.

Finally, the Twitter feed for this blog has as its icon an icosahedron, which I got from Wikimedia Commons:

I also use this as an avatar in various work systems that need one – these generally require small pictures and a face wouldn’t show up well, and it’s easier to pick out than the default in a lot of these systems which is just someone’s initials in a circle.

I do not yet have an icosahedron as a tattoo, but I’ve liked it for a while. If I were to get a tattoo it would be either an icosahedron or the diagram from Byrne’s rendition of Euclid’s proof of the Pythagorean theorem. (The link goes to Nicolas Rougeux’s interactive enhancement of the same.)

]]>But then I had to check that claim. Peter Norvig has, meant to accompany a chapter on NLP, some word lists, of which I’ve used the `enable1.txt`

list before for word puzzles. (I’m not sure who compiled this list.) We can put words into a canonical form by alphabetizing the letters – for example `michael`

becomes `acehilm`

, and `stop`

becomes `opst`

. Scrabble players call this an alphagram. Then to find the four-letter word with the most anagrams is just a matter of counting.

```
library(tidyverse)
words = read_csv(url('https://norvig.com/ngrams/enable1.txt'), col_names = FALSE)
colnames(words) = 'word'
alphabetize_word = function(w){paste(sort(strsplit(w, '')[[1]]), collapse = '')}
words$alphagram = sapply(words$word, alphabetize_word)
words$len = nchar(words$alphagram)
alphagram_counts = words %>% group_by(alphagram, len) %>%
summarize(n = n(), anagrams = paste0(word, collapse = ', '))
alphagram_counts %>% filter(len == 4) %>% arrange(desc(n))
```

And here are the four-letter words with the most anagrams:

```
alphagram_counts %>% filter(len == 4) %>% select(-len) %>% arrange(desc(n))
```

```
# A tibble: 2,655 x 3
alphagram n anagrams
<chr> <int> <chr>
1 aest 8 ates, east, eats, etas, sate, seat, seta, teas
2 aers 7 ares, arse, ears, eras, rase, sear, sera
3 ailr 7 aril, lair, lari, liar, lira, rail, rial
4 astw 6 staw, swat, taws, twas, wast, wats
5 opst 6 opts, post, pots, spot, stop, tops
6 ostw 6 stow, swot, tows, twos, wost, wots
7 aeht 5 eath, haet, hate, heat, thae
8 aels 5 ales, lase, leas, sale, seal
9 aelt 5 late, tael, tale, teal, tela
10 aelv 5 lave, leva, vale, veal, vela
# … with 2,645 more rows
```

No! Child me was wrong!

But wait! What is “seta”? Is “ates” really a thing – you can’t pluralize a verb like that! (“ate” appears to be Tagalog for “older sister”.) Perhaps the aers set, with seven anagrams, wins, but “sera” is technical (plural of serum), and as an American I have trouble recognizing “rase” as a legitimate spelling of “raze”. “lari” is a unit of money in Georgia (Tbilisi, not Atlanta) which I was unfamiliar with. And so on.

Fortunately Norvig also has a list of word frequencies (`count_1w.txt`

), of the 332,202 most common words in a trillion-word corpus. (One of the perks of working at Google, I assume.) So we can read that in.

```
freqs = read_delim(url('https://norvig.com/ngrams/count_1w.txt'),delim = '\t', col_names = FALSE)
colnames(freqs) = c('word', 'freq')
```

The most common words are the ones you’d expect. (2.3% of words are “the”.)

```
> head(freqs)
# A tibble: 6 x 2
word freq
<chr> <dbl>
1 the 23135851162
2 of 13151942776
3 and 12997637966
4 to 12136980858
5 a 9081174698
6 in 8469404971
```

And the least common words are… barely words. (I don’t know the full story behind this dataset.) So it seems reasonable that all “real” words will be here.

```
> tail(freqs)
# A tibble: 6 x 2
word freq
<chr> <dbl>
1 goofel 12711
2 gooek 12711
3 gooddg 12711
4 gooblle 12711
5 gollgo 12711
6 golgw 12711
```

Now we can attach frequencies to the words. There are too many words in the sets for a table to be nice, so we switch to plots.

```
words %>% left_join(alphagram_counts) %>%
filter(len == 4 & n >= 6) %>%
left_join(freqs) %>% arrange(alphagram, desc(freq)) %>%
select(alphagram, word, freq) %>% group_by(alphagram) %>%
mutate(rk = rank(desc(freq))) %>%
ggplot() + geom_line(aes(x=rk, y=log(freq/10^12, 10), group = alphagram, color = alphagram)) +
scale_x_continuous('rank within alphagram set', breaks = 1:8, minor_breaks = c()) +
scale_y_continuous('log_10 of word frequency', breaks = -8:-3, minor_breaks = c()) +
theme_minimal() + geom_text(aes(x=rk, y=log(freq/10^12, 10), color = alphagram, label = word)) +
ggtitle('Frequency of four-letter words with six or more anagrams')
```

And if we plot the frequency of each word against its rank **in its own anagram set**…

then we can see that the STOP set consists of much more common words than any of the others. (STOP isn’t even the most common of its own anagrams, which surprises me – that honor goes to POST. But when I was a small child STOP seemed much more common, because of the signs.) I’m surprised to see SERA so high; this is either an extremely technical corpus or (more likely) contamination from Spanish.

And here’s a similar plot for five letters. Here I’d thought the word with the most anagrams was LEAST (among “common” words, 6: TALES, STEAL, SLATE, TESLA, STALE) but it looks like SPARE wins with room to spare, even if you don’t buy that APRES is an English word.

]]>Sure, you could get out a map and count them. Or you could estimate.

There are 48 contiguous states. The average state has six borders [citation needed], so that’s 288 borders, but we double-counted, so that’s 144. But we need to apply a bit of a haircut for those states that are around the edge. How many of those are there? Figure the US is roughly a 5-by-10 rectangle of states, so there are 30 states around the edge. 144 minus 30 is 114.

There are actually 109. In 1998 Thomas Holmes constructed a data set of those borders for a paper, The Effect of State Policies on the Location of Industry: Evidence from State Borders. I haven’t read the paper. It appears that it shows that there was more manufacturing activity on the “pro-business” (anti-union, has so-called right-to-work laws) side of a state border than on the “anti-business” (pro-union, doesn’t have so-called right-to-work laws) of the state border.

This method could probably also be applied with, say, mask mandates and COVID case rates. Early on in the pandemic there was some coverage of how Tennessee was doing much worse than Kentucky, although that may have been overly politicized (Kentucky has a Democratic governor, Tennessee a Republican) and may have been due to higher testing rates in Tennessee. (See Andrew Gelman’s post on the topic; it appears that data on deaths didn’t show the same gap.)

Some people like counting the borders they’ve crossed, as in this post at Twelve Mile Circle. That post includes a map by Jon Persky that gives 138 borders, but that includes 16 land crossings between contiguous US states and Canadian provinces; 8 between US states and Mexican states; two between Alaska and Canadian provinces; and three borders that can only be crossed by water (Maine – Nova Scotia, New York – Rhode Island, and Ohio – Ontario).

As for that fact that “the average state has six borders”, this is really a statement about planar graphs. From the map of the US, construct a planar graph by taking the 48 states as vertices and the state borders as edges. (You have a problem at Four Corners, which we’ll ignore.) Let E be the number of edges in the graph, and F its number of faces. Here a “face” corresponds to a place where three states meet, such as Pennsylvania-Maryland-Delaware or Georgia-Alabama-Tennessee. Then every edge meets two faces and, except for around the perimeter of the graph, every face has three edges, and thus . Euler proved , which we’ll approximate as . Thus , or rearranging ; the number of edges (state borders) is about three times the number of states.

This is all brought to you by Colin Beveridge’s kids asking the same question about national borders.

]]>294 | |||

216 | |||

135 | |||

98 | |||

112 | |||

84 | |||

245 | |||

40 | |||

8890560 | 156800 | 55566 |

We start by factorizing the column products, to get , , and respectively. Since the second-column product isn’t divisible by 3, the second column must consist of only 1, 2, 4, 5, 7, and 8. The third column isn’t divisible by 4 or 5 so it can’t contain 4, 5, or 8; furthermore it contains only a single even number (2 or 6).

We can explicitly enumerate the possibilities for each row. For example for the row with product 84 we have

```
expand.grid(1:9, 1:9, 1:9) %>% filter(8890560 %% Var1 == 0 & 156800 %% Var2 ==0 & 55566 %% Var3 == 0) %>%
mutate(prod = Var1 * Var2 * Var3) %>% filter(prod == 84)
```

which returns the data frame

```
Var1 Var2 Var3 prod
1 6 7 2 84
2 7 4 3 84
3 4 7 3 84
4 7 2 6 84
5 2 7 6 84
6 6 2 7 84
7 3 4 7 84
```

and so there are seven possibilities for this row. 7, 6, 2 doesn’t appear because the second column can’t contain a multiple of three.

This is actually enough to fill in a few entries, and we also can list all the possibilities for the remaining ones:

6, 7 | 7 | 6, 7 | 294 |

3, 6, 9 | 4, 8 | 3, 6, 9 | 216 |

3, 9 | 5 | 3, 9 | 135 |

2, 7 | 2, 7 | 2, 7 | 98 |

2, 4, 7, 8 | 2, 4, 7, 8 | 2, 7 | 112 |

2, 3, 4, 6, 7 | 2, 4, 7 | 2, 3, 6, 7 | 84 |

5, 7 | 5, 7 | 7 | 245 |

4, 5, 8 | 4, 5, 8 | 1, 2 | 40 |

8890560 | 156800 | 55566 | |

Now the first column has product , so it must have a single 5 and three 7s. The remaining four entries have to multiply to so they must be two 9s and two 8s. That lets us complete the first column, because there is only two possible locations for a 9, two for an 8, and one for a 5. And knowing those first-column values allows us to complete various rows:

7 | 7 | 6 | 294 |

9 | 4, 8 | 3, 6 | 216 |

9 | 5 | 3 | 135 |

7 | 2, 7 | 2, 7 | 98 |

8 | 2, 7 | 2, 7 | 112 |

7 | 2, 4 | 3, 6 | 84 |

5 | 7 | 7 | 245 |

8 | 5 | 1 | 40 |

8890560 | 156800 | 55566 | |

The third column only contains a single even number, which is enough to finish that column, and then work out the second column by arithmetic:

7 | 7 | 6 | 294 |

9 | 8 | 3 | 216 |

9 | 5 | 3 | 135 |

7 | 2 | 7 | 98 |

8 | 2 | 7 | 112 |

7 | 4 | 3 | 84 |

5 | 7 | 7 | 245 |

8 | 5 | 1 | 40 |

8890560 | 156800 | 55566 | |

I was honestly surprised this puzzle was solvable – I didn’t believe there was enough information at first. I think it works out because the first-column product 8890560 is large enough that we can determine uniquely what the values in the column are and only have to put them in order; the third-column only having one even value works as well.

Also, I believe this puzzle was part of the MIT Mystery Hunt that took place this weekend (which I haven’t competed in in a Very Long Time). The Riddler column was named “Can You Hunt For The Mysterious Numbers?” and it says the puzzle was by Barbara Yew, and googling that “name” finds an MIT web page at yewlabs.mit.edu for something called “MYST2021: Maturing Young Scientific Theories: Expanding Reality & You” – the first letters spell “MYSTERY”.

]]>A friend of my wife’s pointed this out because Georgia jumps out at the eye on this map. My wife similarly noticed that Los Angeles did *not* look bad, recent reports of crisis there notwithstanding.

The reason is simple – Georgia has an unusually large number of counties for its size. We have 159 counties in 57,513 square miles, for an average area of 362 square miles. Compare Florida (67 counties, 53,625 square miles, average of 800 square miles per county) or Alabama (67 counties, 50,645 square miles, 756 square miles per county). There’s a belt of states stretching roughly north-northwest from Georgia on this map — Tennessee, Kentucky, Indiana, Ohio – that jumps out, and these are all among the states with the smallest average county size. Each county is represented by a circle with size proportional to its coronavirus case rate, so a state’s intensity of color is roughly (coronavirus case rate per capita) x (average county area). And average county area is larger in some parts of the countries than others, for historical reasons nicely expounded by Ed Stephan.

The Post also has the same map with circle sizes proportional to the total number of cases per county. This, I think, looks much more like expected:

This fixes the issue in Los Angeles – it now has a very large circle, because Los Angeles County has ten million people. However this has the opposite problem – now low-population areas look relatively safe. For overall case rates I prefer a color-shaded map. The Post doesn’t have one, but the New York Times does. Darker/redder colors indicate higher case rates. (I’m old enough to remember when the color scale just went up to red, but not prescient enough to have captured those screen shots on a regular basis.)

The Times also has a map with a circle for each county. This map isn’t directly comparable to The Washington Post map because it’s a map of the total number of cases; it uses larger circles, which has the effect of not making low-population-density states look totally unscathed by the coronavirus but at the cost of overlapping in high-population-density areas like coastal California or the Northeast corridor.

At least to me, it seems natural to interpret the size of a circle as a raw number, and the color as a rate. Either one of the circle maps can be thought of as putting a little bit of red down near each case (recent cases in the WaPo map, all cases in the NYT). And so, if you squint, these maps don’t look all that different from population density maps, which Kieran Healy has called one of America’s ur-choropleths, because the variations in coronavirus rates are swamped by the variations in population density.

]]>As a reminder, on day k of Christmas (k = 1, 2, …, 12) the singer receives 1 of gift 1, 2 of gift 2, …, k of gift k. Christmas has 12 days. (Gift 1 is “a partridge in a pear tree”, gift 2 is “turtle doves”, and so on up to gift 12 which is “drummers drumming”, but this is irrelevant.)

So on day k there are total of gifts; this is . The total number of gifts received is therefore

and by the hockey-stick identity (sometimes also called the Christmas stocking identity) this is . The identity can be proven by induction, but I prefer a combinatorial proof. Consider the subsets of of size 3 and group them according to their largest element. Then there are sets whose largest element is , for each of .

This suggests another identity – what if we group according to the middle element of the subset instead? For example, there are 5 × 8 = 40 3-subsets of [14] whose middle element is 6; each one has one element chosen from 1, 2, …, 5 and one element chosen from 7, 8, …, 14. More generally there are k(13-k) 3-subsets of [14] with middle element k. Thus we have

In terms of the song, this is actually a natural way to count. is the number of gifts of type k, since such gifts get given on the last days – there are 12 total partridges in pear trees, 2 × 11 = 22 total turtle doves, 3 × 10 = 30 total calling birds, and so on until we get back down to 12 drummers drumming. (The most frequent gifts? 42 swans and 42 geese. Maybe that was the question.)

]]>Also the presidential election in Georgia in 2020 was very close, as you may have heard: 49.47% for Biden, 49.24% for Trump. (The law does not allow for runoffs in presidential elections.)

But here’s the surprising thing. Below are two maps of the state of Georgia. Can you spot the difference?

There is *one* county with a different winner in the two maps – Burke County, in the east-central part of the state. And Burke was won by Clinton in 2016 (left map), but by Trump in 2020 (right map). The only county that switched winners switched in the *opposite* direction of the state as a whole.

(2020 results from the state’s official election results site, map by me. 2016 results from opendatasoft, map by me.)

Furthermore, what if Georgia had an electoral college made up of counties?

Historically this is not entirely crazy; Georgia used to have something called the County Unit System for statewide primaries. In this system the largest eight counties were classified as “Urban”, the next-largest 30 counties were classified as “Town”, and the remaining 121 counties were classified as “Rural”; urban, town, and rural counties got 6, 4, and 2 votes respectively, awarded on a winner-take-all basis. This benefited rural candidates.

With current population statistics, the “urban” counties would be Fulton, Gwinnett, Cobb, DeKalb, Clayton, Chatham, Cherokee, and Forsyth. These are the county containing most of Atlanta, six suburban Atlanta counties, and the county containing Savannah; this category would better be called “suburban”.). Trump would have won 308 of the 410 county unit votes in 2020 – he won 2 of the 8 “urban” counties, 21 of the 30 “town” counties, and 106 of the 121 “rural” counties. In 2016 he would have won 310 of 410.

Unsurprisingly, a system this biased was found unconstitutional. But what if we had an electoral college? We can start with a simple calculation:

- in 2020, Biden won 30 of 159 counties, making a total of 53.69% of the state’s population (5,643,569 in Biden counties, 4,867,562 in Trump counties)
- in 2016, Clinton won 31 of 159 counties, making a total of 53.90% of the state’s population (5,666,008 to 4,845,123) – the difference being Burke, mentioned above.

(Populations are census estimates from the Governor’s Office of Planning and Budget; I used 2018 estimates.)

This surprised me, but upon reflection, it makes sense – the red counties in Georgia are *really* red. But we all know a real electoral college gives smaller units undue influence. We can simulate that by adjusting the population of each county. In 2020 Biden won counties with 776,007 more people, but he won 99 fewer counties. So if we give each county eight thousand more “people” – analogous to the electoral votes that correspond to Senators – then Trump wins this state-level electoral college. Also in this world he’d be calling county commissioners instead of the Secretary of State.

But in any case we would not be talking about Georgia “flipping” from Republicans to Democrats between 2016 and 2020. The actual flipping was caused, mostly, by Atlanta suburbs moving to the left – but they happened to do so in a way where no counties crossed over. (The flipping of Gwinnett and Cobb, the two largest purely suburban counties, already happened between 2012 and 2016.). I haven’t explored this in-depth but it’s interesting to think about how an electoral college of counties distorts state-level results as a proxy for how an electoral college of states distorts national results.

]]>