# Figuring out when a book was written from the names in it

My daughter likes the book Knuffle Bunny Too: A Case of Mistaken Identity, by Mo Willems. Maybe I like it more than she does; she’s old enough to pick out her own books now, and doesn’t pick this one. One day Trixie brings her bunny to school, and it turns out that another child has the same bunny and they become friends! (The children, not the bunnies.)

I’ve read this book enough time that my mind can wander while I read it. There’s a list of names embedded in it, of the other kids that Trixie wants to show the bunny to: Amy, Meg, Margot, Jane, Leela, Rebecca, Noah, Robbie, Toshi, Casey, Conny, Parker, Brian.

So… from this list of names, can we figure out when Trixie was born?

The R package `babynames` is really useful for this kind of question. This is a wrapper around the Social Security Administration’s baby names data, which gives the number of births of people with each name in the US, each year, for names that were given to at least five babies. It goes from the most common baby names of 1880:

``> head(babynames)A tibble: 6 x 5year sex name n prop1 1880 F Mary 7065 0.072383592 1880 F Anna 2604 0.026678963 1880 F Emma 2003 0.020521494 1880 F Elizabeth 1939 0.019865795 1880 F Minnie 1746 0.017888436 1880 F Margaret 1578 0.01616720``

to the least common male baby names of 2017:

``> tail(babynames)A tibble: 6 x 5year sex name n prop1 2017 M Zyhier 5 2.55e-062 2017 M Zykai 5 2.55e-063 2017 M Zykeem 5 2.55e-064 2017 M Zylin 5 2.55e-065 2017 M Zylis 5 2.55e-066 2017 M Zyrie 5 2.55e-06``

Here’s code to generate a list of “typical” names from each year ending in 0:

```set.seed(1)
nms = list()
for (y in seq(1880, 2010, by = 10)){
cat(y, ': ', babynames %>% filter(year == y) %>%
sample_n(size = 10, weight = prop, replace = TRUE) %>%
summarize(y, nms = paste0(sort(name), collapse = ', ')) %>% select(nms) %>% unlist(), collapse = '',
'\n')
}
```

which gives output

1880 : Alice, Amy, Callie, Cora, Ella, George, Izora, John, Ulysses, Will
1890 : Agnes, Ben, Frank, Frank, John, Lottie, Lula, Margaret, Mildred, Onie
1900 : Alice, Charlie, Daniel, Dorothy, Elda, Joseph, Joseph, Mary, Monroe, Walter
1910 : Beatrice, Eulah, Francisco, Hoyt, James, John, Joseph, Mabel, Susie, Sylvester
1920 : Blanche, Concetta, Isom, Jane, Jean, Katherine, Mary, Mary, Presley, William
1930 : Bobbie, Charles, Herbert, James, Mack, Marilyn, Raymond, Robert, Salvatore, William
1940 : Alton, Bobbie, Bobby, Dorothy, Frank, Helen, Jerry, Joanne, John, Robert
1950 : Ann, Donald, Donna, Elizabeth, John, Joseph, Ronald, Shirley, Susan, Thomas
1960 : Colleen, Darlene, Don, Glen, Johnny, Mark, Phyllis, Ronald, Steve, Steven
1970 : Christopher, Jeffrey, Kathy, Leonard, Lorene, Paula, Raymond, Rebecca, Sally, Sarah
1980 : Adam, Christee, Christina, Curt, Garrett, Jacob, Maurice, Melissa, Ruby, Todd
1990 : Brian, Candice, Damien, Gaspar, John, Marina, Nicholas, Sarah, Vicente, Walter
2000 : Annalycia, Chrishauna, Cristian, Daniel, Devon, Elizabeth, Maggie, Payton, Sebastian, William
2010 : Ella, Joshua, Kenlee, Lois, Lucian, Makayla, Monique, Noah, Nyasia, Pete

Note that this isn’t the ten most common names in each year. Some names appear twice in some years (Joseph in 1900, Mary in 1920). Some rare names appear (there were only 200 babies named Nyasia born in 2010), but that’s to be expected in a random sample of names. But if you read through this list, at least if you’re American, you see an evolution from “old-fashioned” names to “normal” names to “names people are giving to their kids that sound totally weird”.

So this might be possible. We can plot the frequency of the names occurring in the relevant passage against time:

```knuffle_bunny_names =  c("Amy", "Meg", "Margot", "Jane", "Leela", "Rebecca", "Noah", "Robbie", "Toshi", "Casey", "Conny", "Parker", "Brian")
mins = babynames %>% group_by(year) %>% summarize(minprop = min(prop))
grid = expand.grid(year = unique(babynames\$year), name = knuffle_bunny_names)
props = grid %>% left_join(babynames) %>% group_by(year, name) %>% summarize(prop = sum(prop)) %>%
left_join(mins) %>% mutate(corrected_prop = ifelse(is.na(prop), minprop * 4/5, prop))
props %>% ggplot() + geom_line(aes(x=year, y=corrected_prop, group = name, color = name)) +
scale_y_log10('name frequency', breaks = c(10^((-6):(-2)), 3*(10^((-6):(-2))))) +
scale_x_continuous('birth year') +
ggtitle('Frequency of names appearing in Knuffle Bunny Too')
```

Note the use of the log scale on the y-axis; without that you just learn that everyone was naming their kids Amy and Brian in the 1970s. Names that don’t occur in the data set for a given year are assumed to occur four times, which is the line along the bottom. The Social Security program wasn’t introduced until 1937, and didn’t originally cover all workers, so data coverage is sparse for births pre-1920 or so. But we already knew Trixie isn’t that old.

The probability that 13 randomly chosen kids born in year y have those particular thirteen names is just 13! times the product of the name frequencies:

```props %>% group_by(year) %>%
summarize(n = n(), total_prob = factorial(13) * exp(sum(log(corrected_prop)))) %>%
ggplot() + geom_line(aes(x=year, y=total_prob)) +
scale_y_log10('probability of name set', breaks = 10^((-46):(-38))) +
ggtitle('Probability that 13 randomly chosen newborns have\nthe names of the children from Knuffle Bunny Too')
```

So this set of names has the largest probability of occurring in 2000, followed by 1996, 1997, and 2003. The “right answer” is 2001, according to a 2016 New York Times profile of Mo Willems, at least if Trixie is meant to be Willems’ actual child. The book was published in 2007, and in it Trixie goes to pre-K (likely a class of ages 4 or 5); the previous book in the series was published in 2004 and in it Trixie couldn’t even “speak words” yet.