I had thought, for a few decades now, that STOP was the four-letter word with the most anagrams, with six: STOP itself, POST, POTS, TOPS, OPTS, SPOT. So of course when Josh Millard put out these STOP permutations signs, I had to buy one. It’s a limited edition of 24 stop sign prints, one for each permutation. (I opted for OPTS. As of this writing there are 18 still available; the 6 that have been bought are the five anagrams of STOP other than STOP itself, and SOTP.)
But then I had to check that claim. Peter Norvig has, meant to accompany a chapter on NLP, some word lists, of which I’ve used the enable1.txt
list before for word puzzles. (I’m not sure who compiled this list.) We can put words into a canonical form by alphabetizing the letters – for example michael
becomes acehilm
, and stop
becomes opst
. Scrabble players call this an alphagram. Then to find the four-letter word with the most anagrams is just a matter of counting.
library(tidyverse)
words = read_csv(url('https://norvig.com/ngrams/enable1.txt'), col_names = FALSE)
colnames(words) = 'word'
alphabetize_word = function(w){paste(sort(strsplit(w, '')[[1]]), collapse = '')}
words$alphagram = sapply(words$word, alphabetize_word)
words$len = nchar(words$alphagram)
alphagram_counts = words %>% group_by(alphagram, len) %>%
summarize(n = n(), anagrams = paste0(word, collapse = ', '))
alphagram_counts %>% filter(len == 4) %>% arrange(desc(n))
And here are the four-letter words with the most anagrams:
alphagram_counts %>% filter(len == 4) %>% select(-len) %>% arrange(desc(n))
# A tibble: 2,655 x 3
alphagram n anagrams
<chr> <int> <chr>
1 aest 8 ates, east, eats, etas, sate, seat, seta, teas
2 aers 7 ares, arse, ears, eras, rase, sear, sera
3 ailr 7 aril, lair, lari, liar, lira, rail, rial
4 astw 6 staw, swat, taws, twas, wast, wats
5 opst 6 opts, post, pots, spot, stop, tops
6 ostw 6 stow, swot, tows, twos, wost, wots
7 aeht 5 eath, haet, hate, heat, thae
8 aels 5 ales, lase, leas, sale, seal
9 aelt 5 late, tael, tale, teal, tela
10 aelv 5 lave, leva, vale, veal, vela
# … with 2,645 more rows
No! Child me was wrong!
But wait! What is “seta”? Is “ates” really a thing – you can’t pluralize a verb like that! (“ate” appears to be Tagalog for “older sister”.) Perhaps the aers set, with seven anagrams, wins, but “sera” is technical (plural of serum), and as an American I have trouble recognizing “rase” as a legitimate spelling of “raze”. “lari” is a unit of money in Georgia (Tbilisi, not Atlanta) which I was unfamiliar with. And so on.
Fortunately Norvig also has a list of word frequencies (count_1w.txt
), of the 332,202 most common words in a trillion-word corpus. (One of the perks of working at Google, I assume.) So we can read that in.
freqs = read_delim(url('https://norvig.com/ngrams/count_1w.txt'),delim = '\t', col_names = FALSE)
colnames(freqs) = c('word', 'freq')
The most common words are the ones you’d expect. (2.3% of words are “the”.)
> head(freqs)
# A tibble: 6 x 2
word freq
<chr> <dbl>
1 the 23135851162
2 of 13151942776
3 and 12997637966
4 to 12136980858
5 a 9081174698
6 in 8469404971
And the least common words are… barely words. (I don’t know the full story behind this dataset.) So it seems reasonable that all “real” words will be here.
> tail(freqs)
# A tibble: 6 x 2
word freq
<chr> <dbl>
1 goofel 12711
2 gooek 12711
3 gooddg 12711
4 gooblle 12711
5 gollgo 12711
6 golgw 12711
Now we can attach frequencies to the words. There are too many words in the sets for a table to be nice, so we switch to plots.
words %>% left_join(alphagram_counts) %>%
filter(len == 4 & n >= 6) %>%
left_join(freqs) %>% arrange(alphagram, desc(freq)) %>%
select(alphagram, word, freq) %>% group_by(alphagram) %>%
mutate(rk = rank(desc(freq))) %>%
ggplot() + geom_line(aes(x=rk, y=log(freq/10^12, 10), group = alphagram, color = alphagram)) +
scale_x_continuous('rank within alphagram set', breaks = 1:8, minor_breaks = c()) +
scale_y_continuous('log_10 of word frequency', breaks = -8:-3, minor_breaks = c()) +
theme_minimal() + geom_text(aes(x=rk, y=log(freq/10^12, 10), color = alphagram, label = word)) +
ggtitle('Frequency of four-letter words with six or more anagrams')
And if we plot the frequency of each word against its rank in its own anagram set…

then we can see that the STOP set consists of much more common words than any of the others. (STOP isn’t even the most common of its own anagrams, which surprises me – that honor goes to POST. But when I was a small child STOP seemed much more common, because of the signs.) I’m surprised to see SERA so high; this is either an extremely technical corpus or (more likely) contamination from Spanish.
And here’s a similar plot for five letters. Here I’d thought the word with the most anagrams was LEAST (among “common” words, 6: TALES, STEAL, SLATE, TESLA, STALE) but it looks like SPARE wins with room to spare, even if you don’t buy that APRES is an English word.

Recently I found that taking the intersection of a Scrabble dictionary (carefully curated, includes unrecognisably obscure words) and Norvig’s 1/3 million word collection (uncurated, included gibberish non-words) gave me a dictionary of about 90,000 words that feels “fair”. It has words that, even if you don’t know them, you feel that it is plausible that you could have known them.