STOP POTS POST. SPOT OPTS TOPS.

I had thought, for a few decades now, that STOP was the four-letter word with the most anagrams, with six: STOP itself, POST, POTS, TOPS, OPTS, SPOT. So of course when Josh Millard put out these STOP permutations signs, I had to buy one. It’s a limited edition of 24 stop sign prints, one for each permutation. (I opted for OPTS. As of this writing there are 18 still available; the 6 that have been bought are the five anagrams of STOP other than STOP itself, and SOTP.)

But then I had to check that claim. Peter Norvig has, meant to accompany a chapter on NLP, some word lists, of which I’ve used the enable1.txt list before for word puzzles. (I’m not sure who compiled this list.) We can put words into a canonical form by alphabetizing the letters – for example michael becomes acehilm, and stop becomes opst. Scrabble players call this an alphagram. Then to find the four-letter word with the most anagrams is just a matter of counting.

library(tidyverse)
words = read_csv(url('https://norvig.com/ngrams/enable1.txt'), col_names = FALSE)
colnames(words) = 'word'
alphabetize_word = function(w){paste(sort(strsplit(w, '')[[1]]), collapse = '')}
words$alphagram = sapply(words$word, alphabetize_word)
words$len = nchar(words$alphagram)
alphagram_counts = words %>% group_by(alphagram, len) %>% 
  summarize(n = n(), anagrams = paste0(word, collapse = ', '))
alphagram_counts %>% filter(len == 4) %>% arrange(desc(n))

And here are the four-letter words with the most anagrams:

alphagram_counts %>% filter(len == 4) %>% select(-len) %>% arrange(desc(n))

# A tibble: 2,655 x 3
   alphagram     n anagrams                                      
   <chr>     <int> <chr>                                         
 1 aest          8 ates, east, eats, etas, sate, seat, seta, teas
 2 aers          7 ares, arse, ears, eras, rase, sear, sera      
 3 ailr          7 aril, lair, lari, liar, lira, rail, rial      
 4 astw          6 staw, swat, taws, twas, wast, wats            
 5 opst          6 opts, post, pots, spot, stop, tops            
 6 ostw          6 stow, swot, tows, twos, wost, wots            
 7 aeht          5 eath, haet, hate, heat, thae                  
 8 aels          5 ales, lase, leas, sale, seal                  
 9 aelt          5 late, tael, tale, teal, tela                  
10 aelv          5 lave, leva, vale, veal, vela                  
# … with 2,645 more rows

No! Child me was wrong!

But wait! What is “seta”? Is “ates” really a thing – you can’t pluralize a verb like that! (“ate” appears to be Tagalog for “older sister”.) Perhaps the aers set, with seven anagrams, wins, but “sera” is technical (plural of serum), and as an American I have trouble recognizing “rase” as a legitimate spelling of “raze”. “lari” is a unit of money in Georgia (Tbilisi, not Atlanta) which I was unfamiliar with. And so on.

Fortunately Norvig also has a list of word frequencies (count_1w.txt), of the 332,202 most common words in a trillion-word corpus. (One of the perks of working at Google, I assume.) So we can read that in.

freqs = read_delim(url('https://norvig.com/ngrams/count_1w.txt'),delim = '\t', col_names = FALSE)
colnames(freqs) = c('word', 'freq')

The most common words are the ones you’d expect. (2.3% of words are “the”.)

> head(freqs)
# A tibble: 6 x 2
  word         freq
  <chr>       <dbl>
1 the   23135851162
2 of    13151942776
3 and   12997637966
4 to    12136980858
5 a      9081174698
6 in     8469404971

And the least common words are… barely words. (I don’t know the full story behind this dataset.) So it seems reasonable that all “real” words will be here.

> tail(freqs)
# A tibble: 6 x 2
  word     freq
  <chr>   <dbl>
1 goofel  12711
2 gooek   12711
3 gooddg  12711
4 gooblle 12711
5 gollgo  12711
6 golgw   12711

Now we can attach frequencies to the words. There are too many words in the sets for a table to be nice, so we switch to plots.

words %>% left_join(alphagram_counts)  %>%
  filter(len ==  4 & n >= 6)  %>% 
  left_join(freqs) %>% arrange(alphagram, desc(freq)) %>% 
  select(alphagram,  word, freq) %>% group_by(alphagram) %>% 
  mutate(rk = rank(desc(freq))) %>% 
  ggplot() + geom_line(aes(x=rk,  y=log(freq/10^12, 10), group = alphagram, color = alphagram)) + 
  scale_x_continuous('rank within alphagram set', breaks  = 1:8, minor_breaks = c()) + 
  scale_y_continuous('log_10 of word frequency', breaks = -8:-3, minor_breaks  = c()) +
  theme_minimal() + geom_text(aes(x=rk, y=log(freq/10^12, 10), color = alphagram, label = word)) +
  ggtitle('Frequency of four-letter words with six or more anagrams')

And if we plot the frequency of each word against its rank in its own anagram set…

then we can see that the STOP set consists of much more common words than any of the others. (STOP isn’t even the most common of its own anagrams, which surprises me – that honor goes to POST. But when I was a small child STOP seemed much more common, because of the signs.) I’m surprised to see SERA so high; this is either an extremely technical corpus or (more likely) contamination from Spanish.

And here’s a similar plot for five letters. Here I’d thought the word with the most anagrams was LEAST (among “common” words, 6: TALES, STEAL, SLATE, TESLA, STALE) but it looks like SPARE wins with room to spare, even if you don’t buy that APRES is an English word.

One thought on “STOP POTS POST. SPOT OPTS TOPS.”

Julian O. (@Julian_O) says:

February 28, 2021 at 8:24 pm

Recently I found that taking the intersection of a Scrabble dictionary (carefully curated, includes unrecognisably obscure words) and Norvig’s 1/3 million word collection (uncurated, included gibberish non-words) gave me a dictionary of about 90,000 words that feels “fair”. It has words that, even if you don’t know them, you feel that it is plausible that you could have known them.

STOP POTS POST. SPOT OPTS TOPS.

Published by Michael Lugo

One thought on “STOP POTS POST. SPOT OPTS TOPS.”

Leave a comment Cancel reply

Share this:

Related

Published by Michael Lugo

One thought on “STOP POTS POST. SPOT OPTS TOPS.”

Leave a comment Cancel reply