Four points is not enough – the enumerative version

As I observed last week, four points is not enough to win one’s group in the World Cup. With four points (a win, a loss, and a draw) you have roughly a 50% chance of advancing to the knockout stage, based on historical data.

We can also verify this by working out all the possible results of a group. There are six games in each group, so 3^6 = 729 possibilities. If we weight each of these possibilities equally, it amounts to assuming that each game is a win for A, a draw, or a loss for A with equal probability. I wouldn’t want to do this with hand, but by computer it’s easy enough. As usual, using dplyr:

#flip(0) = 3, flip(1) = 1, flip(3) = 0
flip = function(x){(6-5*x+x^2)/2} 

#column xy is the number of points that team x gets in the game between x and y
#a, b, c, d: total number of points for each team
#a_place: place of team A
#a_tie: number of teams with same number of points as A
#p_advance: probability that A advances
#p_place1: probability that A is in first place
groups = expand.grid(ab = c(0,1,3), ac = c(0,1,3), 
                     ad = c(0,1,3), bc = c(0,1,3), 
                     bd = c(0,1,3), cd = c(0,1,3)) %>% 
  mutate(a = ab + ac + ad,
         b = flip(ab) + bc + bd,
         c = flip(ac) + flip(bc) + cd,
         d = flip(ad) + flip(bd) + flip(cd)) %>% 
  mutate(a_place = 4 - ((a >= b) + (a >= c) + (a >= d)), 
         a_tie = 1 + (a==b) + (a==c) + (a==d), 
         p_advance = ifelse(a_place >= 3, 0, 
                            ifelse(a_place + a_tie <= 3,  1, (3-a_place)/(a_tie))),
         p_place1 = ifelse(a_place >= 2, 0, ifelse(a_place + a_tie <= 2, 1, (2-a_place)/(a_tie)))
         )

The data frame groups has 729 rows, one for each possible outcome of the six games in the group. See the example below, where A, B, C, and D have 4, 4, 3, and 5 points respectively. One way to get this is in the first row:

  • A loses to B, A defeats C, A and D draw – 4 points for A
  • (B defeats A), B loses to C, B and D draw – 4 points for B
  • (C loses to A, C defeats B), C loses to D – 3 points for C
  • (D and A draw, D and B draw, D defeats C) – 5 points for D

and the other is in the second, which is the same with A and B interchanged.

groups %>% filter(a==4, b==4, c==3, d==5)
  ab ac ad bc bd cd a b c d a_place a_tie p_advance p_place1 p_place2
1  0  3  1  0  1  0 4 4 3 5       2     2       0.5        0      0.5
2  3  0  1  3  1  0 4 4 3 5       2     2       0.5        0      0.5

In each of these cases team a is in a two-way tie (a_tie) for second place (a_place); if ties are broken at random, then team a has a probability 0.5 to advance, all coming from second place. Of course ties aren’t broken at random, but I’m not going to model goal differential.

Then we can compute the probability of advancing with each possible point total by aggregation:

 groups %>% group_by(a) %>% summarize(prob = n()/3^6, prob_advance = mean(p_advance), prob_place1 = mean(p_place1))
# A tibble: 9 × 4
      a   prob prob_advance prob_place1
  <dbl>  <dbl>        <dbl>       <dbl>
1     0 0.0370       0          0      
2     1 0.111        0          0      
3     2 0.111        0.0123     0      
4     3 0.148        0.0787     0.00231
5     4 0.222        0.543      0.0216 
6     5 0.111        0.988      0.457  
7     6 0.111        0.975      0.469  
8     7 0.111        1          0.944  
9     9 0.0370       1          1     

To advance you need 7 points (to be sure); 5 will do except in freak cases. To win the group for sure you need 9 points, but 7 will do; 5 or 6 is a 50-50 shot. And we can plot it:

This reproduces what Greg Stoll found in 2014.

It’s natural to zoom in on the surprises:

  • how to advance with two points. Here you want a group with scores 9-2-2-2 – one team wins against the other three (including you), those three trade draws, and you win the tiebreaker, meaning you lost your game to the 9-pointer by the fewest goals.
  • how to win your group with three points. All six games must be draws, then you win the tiebreaker. (The first tiebreaker is goal difference, which would obviously be zero for all teams; the second is goals scored)
  • how to fail to advance with five points. This requires a group with scores 5-5-5-0 – one team loses all three of its games, the other three trade draws, and you lose the tiebreaker, meaning you win your game with the 0-pointer by the fewest goals. This is the reverse of the 9-2-2-2 group above.
  • how to fail to advance with six points. This requires a group with scores 6-6-6-0 – like the 5-5-5-0 group, except the three leading teams form a cycle of wins.

The first three have never happened in the World Cup; as I mentioned in my last post, the last one happened twice, both times in 1994.

If you want to know what probability a given team actually has of winning, see FiveThirtyEight. For the scenarios that cause it (including tiebreakers), see the NYT’s Upshot. The simplest scenario is that for the United States – if the US beats Iran today, they advance, otherwise they do not.

Press will buzz off, or, doubled letters at the ends of words

John D Cook wrote, following the latest episode of Kevin Stroud’s History of English podcast, that if a consonant at the end of a word is doubled, it’s probably S, L, F, or Z. It appears that some elementary school teachers teach a “doubled final consonant rule” that is roughly the converse of this: “If a short vowel word or syllable ends with the /f/, /l/, /s/, or /z/ sound, it usually gets a double f, l, s, or z at the end.”

That sounds right to me. Some other consonants seem doubleable but relatively rarely – G, N, and R came to mind, although the only words I could think of that actually end in those doubled are “egg”, “inn”, and “err” respectively. Those are probably instances of the three-letter rule, whereby “content words” tend to have at least three letters. (Walmart has a store-brand of electronics onn. (sic), as well.)

Cook writes: “These stats simply count words; I suspect the results would be different if the words were weighted by frequency.” Challenge accepted. Peter Norvig has some useful data, including “The 1/3 million most frequent words, all lowercase, with counts.”

counts = read_delim('https://norvig.com/ngrams/count_1w.txt', delim = '\t', col_names = c('word', 'count'))

counts %>% mutate(ult = str_sub(word, -1),
                  penult = str_sub(word, -2, -2)) %>%
  filter(!(ult %in% c('a', 'e', 'i', 'o', 'u'))) %>%
  mutate(double = (ult == penult)) %>%
  group_by(ult) %>% 
  summarize(double = sum(double*count), all = sum(count)) %>%
  mutate(pct_double = double/all * 100)

Result is as follows; the counts in columns “double” and “all” are numbers of words out of the “trillion-word” data set. So over one out of every four words ending in L, in ends in double-L; only 0.02% of words ending in Y end in double-Y. Norvig’s word counts come from the 2006 trillion-word data set from Alex Franz and Thorsten Brants at the Google Machine Translation Team coming from public web pages.

# A tibble: 21 × 4
   ult       double         all pct_double
   <chr>      <dbl>       <dbl>      <dbl>
 1 l     6881678966 24910774907    27.6   
 2 z       49070005   684008846     7.17  
 3 s     3928825545 84468484631     4.65  
 4 f      734823669 16788706705     4.38  
 5 x       85268118  3108416171     2.74  
 6 c      128294648  6294874635     2.04  
 7 j        7948601   390085263     2.04  
 8 b       49690347  2690064550     1.85  
 9 p       98252923  6905211199     1.42  
10 d      460147985 47335849371     0.972 
11 m       95423709 11371100313     0.839 
12 w       52402066  6908722748     0.758 
13 q        2645593   456516943     0.580 
14 t      295347877 52262740152     0.565 
15 n      238685552 49492910349     0.482 
16 v        3478513  1084201734     0.321 
17 g       51208927 19948325553     0.257 
18 r       58629294 38947533393     0.151 
19 k        8943711  8602400357     0.104 
20 h       13449679 14180781466     0.0948
21 y        7451329 35763181677     0.0208

The count for “L” is so high because of words like “all” and “will”. It turns out in this corpus my intuition about G, N, and R being plausible is spectacularly wrong. We can also get the five most common words ending in each double letter:

counts %>% mutate(ult = str_sub(word, -1),
                  penult = str_sub(word, -2, -2)) %>%
  filter(!(ult %in% c('a', 'e', 'i', 'o', 'u'))) %>%
  mutate(double = (ult == penult))  %>% filter(double) %>% group_by(ult) %>% mutate(rk = rank(-count)) %>% filter(rk <= 5) %>% summarize(paste(word, collapse = ', '))  %>% print(n = 21)

giving the following table:

# A tibble: 21 × 2
   ult   top5                                   
   <chr> <chr>                                  
 1 b     bb, phpbb, webb, bbb, cobb             
 2 c     cc, acc, gcc, fcc, icc                 
 3 d     add, dd, odd, todd, hdd                
 4 f     off, staff, stuff, diff, jeff          
 5 g     egg, gg, digg, ogg, dogg               
 6 h     hh, ahh, ahhh, ohh, hhh                
 7 j     jj, hajj, jjj, bjj, jjjj               
 8 k     kk, dkk, skk, fkk, kkk                 
 9 l     all, will, well, full, small           
10 m     mm, comm, hmm, hmmm, dimm              
11 n     inn, ann, lynn, nn, penn               
12 p     pp, app, ppp, supp, spp                
13 q     qq, qqqq, qqq, sqq, haqq               
14 r     rr, err, carr, starr, corr             
15 s     business, address, access, class, press
16 t     scott, matt, butt, tt, hewlett         
17 v     vv, vvv, rvv, vvvv, cvv                
18 w     www, ww, aww, libwww, awww             
19 x     xxx, xx, xnxx, xxxx, vioxx             
20 y     yy, yyyy, nyy, yyy, abbyy              
21 z     jazz, buzz, zz, jizz, azz  

and S, L, F, and Z do emerge as the only letters where the resulting words aren’t just junk. The rules seems to loosen for names (Matt, Scott, Hewlett; Ann, Lynn, Penn; Cobb). The data set betrays its internet origins here, with phpbb and libwww being relatively common.

I’ve found the pattern at the end of the code block above,

group_by(ult) %>% mutate(rk = rank(-count)) %>% filter(rk <= 5) %>% summarize(paste(word, collapse = ', '))

useful in practice for giving a quick summary of a table where there are many possible values of a second variable for each value of a first variable, and we just want to show which are most common.

Four points is not enough

The World Cup just started. As you may know, there are 32 teams, in eight groups of four. Each team in a group plays three games, one against each of the other teams in the group. The top two teams in each group advance to the “knockout round” of 16; teams get three points for a win, one point for a draw, and no points for a loss, so the most points a team can have is 9, the least is 0. (It’s not possible to get 8 points, but every other number is possible.)

So how many points is enough to advance?

With a quick Google I found this article from the Hindustan Times in 2018, saying that traditionally people think four points is enough, but in practice, 17 out of 33 teams with four points were in the top two in their group between 1994 and 2014. (By my count it’s 18 out of 35, which doesn’t materially impact the conclusion.). “Three points for a win” was introduced in the 1994 World Cup, so this is as far back as it’s meaningful to go.

Similarly, in the run-up to the current tournament, Fox Sports Australia writes: “Four points is really that magical mark that they need to aim at. You can miss the top two of your group with four points – 10 teams have done it across the last four World Cups – but the overwhelming majority of teams that reach that figure make it out.”

But if you stop to think about it for a moment, a win, a loss, and a draw (which is the only way to get four points) is a middling result, and you need to be in the top half to advance… this is best illustrated by 1994 group E, where all four teams got four points, and of course only two advanced.

And in Slate a couple days ago, Eric Betts wrote: “One win, one draw and one disheartening loss might be enough to get to the knockout rounds, but not necessarily. Two teams with four points advanced in 2018 and one—Iran—was sent home.” (There’s a slight error here – Argentina and Japan advanced with four points, Iran and Senegal didn’t.)

But if we go through and tabulate all 54 groups in the 1994 through 2018 World Cups (six groups in 1994, eight in each of 1998-2018) we really see that four points is not enough. Here’s a table of the teams by their rank in group and their number of points.

There’s an interesting anomaly here. Teams have not made the top two with six points – in 1994 both group D and group F had the point totals 6-6-6-0, with one team that was beaten by each of the other three, while the other three had a “cycle” of wins. (The third-place team in each of those groups advanced to the knockout round – in 1994 there were only six groups, as opposed to the current eight, so the top four third-place teams also advanced to the knockout round.). But no team has ever failed to advance with 5. (This is possible – a group where team A loses to B, C, and D, and all three games among B, C, D are ties, would have point totals 5-5-5-0.)

So you need five points to win (unless something weird happens.). Four is not enough.

When do the polls close for most people?

I tried to Google this and couldn’t – how many people live in places where the polls close tonight at time X, for each possible value of X? You can easily find a map of closing times, for example at 270towin.com:

But how many people do each of those colors represent?

The Green Papers has a nice table of poll closing times. The wrinkle is that in a state with multiple time zones, there are two possibilities:
  • polls close at the same clock time across the state. For example, in Florida, polls close at 7:00 PM everywhere in the state… as we learned in 2000 when it was called for Democrats before polls in the Panhandle (the westernmost part of the state, the only part on Central time, and a heavily Republican area) had closed. This is the method in Alaska, Florida, Idaho, Kansas, Kentucky, Michigan, North Dakota, South Dakota, and Texas.
  • polls close at the same real time across the state. This is what Nebraska does (8 Central / 7 Mountain) and Tennessee (8 Eastern / 7 Central).
We need to make some assumptions about what proportion of each split state is in each time zone. Fortunately, someone named “segacs” made a Sporcle quiz in 2015 which counted the population in each time zone worldwide, and broke down the results in a Google spreadsheet. We can just extrapolate those results forward – Tennessee was 34% Eastern time and 66% Central according to 2015 Census data, so we’ll just carry that forward to 2020. If everybody’s been leaving Knoxville and Chattanooga to move to Nashville and Memphis, we won’t know.

As it turns out, in most places the polls close at 7 or 8 local time, and those represent about equal numbers of people. The exceptions are:

  • Kentucky and Indiana (6 pm local)
  • North Carolina, Ohio, West Virginia, and Arkansas: 7:30 pm local
  • New York and North Dakota: 9 pm local. (Is there anything else New York and North Dakota have in common?)

The overall distribution is in the chart below.

And in Eastern-time terms, the distribution is:

Both of these charts and the underlying data are at this Google spreadsheet.

This should be familiar to people who make a habit of watching the election returns roll in… you get the first substantial votes at 7, a big chunk at 8, and they trickle in over the rest of the night. (In presidential years the 11:00 chunk isn’t as interesting as you’d expect from its volume – the only polls closing at 11 are California, Oregon, Washington, and small portions of North Dakota and Idaho, and if any of those states are competitive the election as a whole is not.)

Too small to show on the chart is the polls that close at 1 AM. Those are the polls that close at 8 PM (Hawaii-)Aleutian time (UTC-10, five hours behind Eastern time), in that part of the Aleutian Islands of Alaska west of 169° 30′ W longitude. In terms of populated places it looks like this is a really long-winded way of saying Adak. Adak has 326 people. The biggest settlement in the Aleutians, Unalaska, is only at 166° 32′ W and is therefore in UTC-9, “Alaska time”. Brian Brettschneider, Alaska-based climatologist, called out Adak in 2016:

https://platform.twitter.com/widgets.js

and at least a cursory look at a list of Alaska polling places suggests there are two in the Aleutians, “Aleutians No. 1” in Adak and “Aleutians No. 2” in Unalaska. It seems quite reasonable that there is only one polling place, the one in Adak, that closes at 1 AM Eastern. This oddity has been mentioned before, in 2012 and 2016, in both cases by local sources. In 2016 five people voted after 8 PM Alaska / 7 PM Aleutian (midnight Eastern).

Not that anything will be called in Alaska when the polls close… Alaska uses ranked-choice voting, so it’ll take a while to count the votes anyway.

Does Game 3 have some special magic?

Saw a “statistic” during game 3 of the World Series yesterday – teams that have won Game 3 have gone on to win 68 times out of 98 (69%).

First of all, 98 is a strange denominator there… that would be “since 1923”, but the first World Series was played in 1903! But 1922 was the last time there was a tied game in a World Series, so is this presumably 1923 through 2021 (99 years) with the exception of 1994?

You can find this statistic in, for example, this CBS Sports item or this tweet from @MLB: https://platform.twitter.com/widgets.js

Okay, turns out I misheard it – it’s the winner of all Game Threes in best-of-seven MLB postseason series when the first two games were split 1-1.

So is this surprising? Not really… the winner of such a Game Three has to win at least two out of the next four to win the series. If games are coin flips, that happens 11/16 of time (69%). Game Three doesn’t have some special magic… it’s just the a 2-1 lead is substantial.

This post should be going out at the first pitch of Game 4. Go Phillies!