Ben Zimmer writes a column for the New York Times, “On Language”. His June 25, 2010 column was entitled Ghoti. It’s not about beards. That’s not a misspelling of “goatee”. Rather, it’s a misspelling of “fish” (the “gh” of “enough”, the “o” of “women”, and the “ti” of “action”) that’s traditionally attributed to George Bernard Shaw.
In this column we learn about the absurd respellings that Alexander Ellis, a mid-ninteenth-century spelling reformer, came up with. And he did some calculations. He thought “scissors” should be spelled “sizerz” (okay, that’s not bad, although how would you spell “sizers”, as in “people who size”?), but at least it’s not spelled “schiesourrhce” (“combining parts of SCHism, sIEve, aS, honOUr, myRRH and sacrifiCE.”).
And Ellis gave three different numbers for the number of possible spellings of “scissors”: 1745226, 58366440, and 81997920. In the interest of trying to guess where these came from, the first thing that comes to mind is finding the prime factorizations. Why? Well, say someone told us “there are twelve ways to spell cat“. We’d logically think that they’d come up with, say, three ways to spell the first sound of that word (say, “c”, “k”, and “ck”) , three ways to spell the second sound (“a” and “ah”), and two ways to spell the third sound (“t” and “tt”), for a total of spellings:
cat, catt, caht, cahtt, kat, katt, kaht, kahtt, ckat, ckatt, ckaht, ckahtt
Of course English doesn’t work that way — you can spell the first sound of “cat” as “ck” but not at the beginning of a word! Zimmer tells us that Ellis acknowledged this. But if you assume the calculation was done this way, then twelve is an easy number to get. But eleven and thirteen are less likely, being primes. The numbers obtained in this way should be products of relatively small numbers, and therefore shouldn’t have large prime factors. And indeed we get
and these could conceivably be products of six relatively small numbers. For example:
Where did I get these from? Let’s consider how I went from to
in my decomposition of 58366440. I’ve already written
. I know I’m going to have to write
as a product of four numbers, so they’re going to be near
. It turns out that
is an integer, namely
, and no factor of
is closer to its fourth root than 17 is. (That is, 18, 19, 20, 21, 22, and 23 are not factors of 162129.) This is a greedy algorithm, and these aren’t optimal decompositions in the sense of having the smallest sum. For example in the last one I could replace 24 and 9, which multiply to 216, with 18 and 12 which have the same product but a smaller sum. But there’s no reason to expect that Ellis’ products had this property anyway; some sounds can be spelled in more way than others. In particular the last one of these is unlikely to be what Ellis came up with, because the word “scissors” has two of the same sound — so I’d expect two of the factors to be the same. But what do you want from a greedy algorithm?
By the way, it’s not terribly hard to write down rules for going from spelling to pronunciation that work reasonably well. It seems like the same should be true of the reverse.
I’m looking for a job! See my linkedin profile.
Long ago, I set up a linked in profile. I didn’t find Linked in useful, and attempted to delete it. It didn’t get deleted, but I cannot get back on. (And so many people have asked me to, that I do wish I could.) Anyway, I don’t know of any jobs right now, but I would have glanced at your profile – and I can’t.
Thanks anyway, Sue. (I’ve actually heard a lot of complaints about linkedin today, because there was a hacking incident where a lot of people’s passwords were stolen.)
I reimplemented a sounds-like system (word -> phonemes -> words) recently, details and source download here: http://shape-of-code.coding-guidelines.com/2012/03/16/generating-sounds-like-and-accented-words/
Good luck with the job hunt and I hope yo continue to blog.