John D Cook wrote, following the latest episode of Kevin Stroud’s History of English podcast, that if a consonant at the end of a word is doubled, it’s probably S, L, F, or Z. It appears that some elementary school teachers teach a “doubled final consonant rule” that is roughly the converse of this: “If a short vowel word or syllable ends with the /f/, /l/, /s/, or /z/ sound, it usually gets a double f, l, s, or z at the end.”
That sounds right to me. Some other consonants seem doubleable but relatively rarely – G, N, and R came to mind, although the only words I could think of that actually end in those doubled are “egg”, “inn”, and “err” respectively. Those are probably instances of the three-letter rule, whereby “content words” tend to have at least three letters. (Walmart has a store-brand of electronics onn. (sic), as well.)
Cook writes: “These stats simply count words; I suspect the results would be different if the words were weighted by frequency.” Challenge accepted. Peter Norvig has some useful data, including “The 1/3 million most frequent words, all lowercase, with counts.”
counts = read_delim('https://norvig.com/ngrams/count_1w.txt', delim = '\t', col_names = c('word', 'count')) counts %>% mutate(ult = str_sub(word, -1), penult = str_sub(word, -2, -2)) %>% filter(!(ult %in% c('a', 'e', 'i', 'o', 'u'))) %>% mutate(double = (ult == penult)) %>% group_by(ult) %>% summarize(double = sum(double*count), all = sum(count)) %>% mutate(pct_double = double/all * 100)
Result is as follows; the counts in columns “double” and “all” are numbers of words out of the “trillion-word” data set. So over one out of every four words ending in L, in ends in double-L; only 0.02% of words ending in Y end in double-Y. Norvig’s word counts come from the 2006 trillion-word data set from Alex Franz and Thorsten Brants at the Google Machine Translation Team coming from public web pages.
# A tibble: 21 × 4 ult double all pct_double <chr> <dbl> <dbl> <dbl> 1 l 6881678966 24910774907 27.6 2 z 49070005 684008846 7.17 3 s 3928825545 84468484631 4.65 4 f 734823669 16788706705 4.38 5 x 85268118 3108416171 2.74 6 c 128294648 6294874635 2.04 7 j 7948601 390085263 2.04 8 b 49690347 2690064550 1.85 9 p 98252923 6905211199 1.42 10 d 460147985 47335849371 0.972 11 m 95423709 11371100313 0.839 12 w 52402066 6908722748 0.758 13 q 2645593 456516943 0.580 14 t 295347877 52262740152 0.565 15 n 238685552 49492910349 0.482 16 v 3478513 1084201734 0.321 17 g 51208927 19948325553 0.257 18 r 58629294 38947533393 0.151 19 k 8943711 8602400357 0.104 20 h 13449679 14180781466 0.0948 21 y 7451329 35763181677 0.0208
The count for “L” is so high because of words like “all” and “will”. It turns out in this corpus my intuition about G, N, and R being plausible is spectacularly wrong. We can also get the five most common words ending in each double letter:
counts %>% mutate(ult = str_sub(word, -1), penult = str_sub(word, -2, -2)) %>% filter(!(ult %in% c('a', 'e', 'i', 'o', 'u'))) %>% mutate(double = (ult == penult)) %>% filter(double) %>% group_by(ult) %>% mutate(rk = rank(-count)) %>% filter(rk <= 5) %>% summarize(paste(word, collapse = ', ')) %>% print(n = 21)
giving the following table:
# A tibble: 21 × 2 ult top5 <chr> <chr> 1 b bb, phpbb, webb, bbb, cobb 2 c cc, acc, gcc, fcc, icc 3 d add, dd, odd, todd, hdd 4 f off, staff, stuff, diff, jeff 5 g egg, gg, digg, ogg, dogg 6 h hh, ahh, ahhh, ohh, hhh 7 j jj, hajj, jjj, bjj, jjjj 8 k kk, dkk, skk, fkk, kkk 9 l all, will, well, full, small 10 m mm, comm, hmm, hmmm, dimm 11 n inn, ann, lynn, nn, penn 12 p pp, app, ppp, supp, spp 13 q qq, qqqq, qqq, sqq, haqq 14 r rr, err, carr, starr, corr 15 s business, address, access, class, press 16 t scott, matt, butt, tt, hewlett 17 v vv, vvv, rvv, vvvv, cvv 18 w www, ww, aww, libwww, awww 19 x xxx, xx, xnxx, xxxx, vioxx 20 y yy, yyyy, nyy, yyy, abbyy 21 z jazz, buzz, zz, jizz, azz
and S, L, F, and Z do emerge as the only letters where the resulting words aren’t just junk. The rules seems to loosen for names (Matt, Scott, Hewlett; Ann, Lynn, Penn; Cobb). The data set betrays its internet origins here, with phpbb and libwww being relatively common.
I’ve found the pattern at the end of the code block above,
group_by(ult) %>% mutate(rk = rank(-count)) %>% filter(rk <= 5) %>% summarize(paste(word, collapse = ', '))
useful in practice for giving a quick summary of a table where there are many possible values of a second variable for each value of a first variable, and we just want to show which are most common.