Pairs of cities with the same population

I was looking at the list of US cities by population on Wikipedia yesterday, because I noticed that Sunnyvale, a suburb of San Jose that I had occasion to go to yesterday, had a surprisingly large population of 140,095. There are a lot of places like this in California — despite having about 12% of US population, it has 64 of the 275 largest cities (all those with population above 100,000), or about 23%.

And among those 275 cities there are three pairs with the same population in the 2010 Census:

  • Fargo, North Dakota and Norwalk, California, both at 105,549
  • Arvada, Colorado and Ventura, California, both at 106,433
  • Aurora, Illinois and Oxnard, California, both at 197,899

Of course census data shouldn’t actually be taken to be exact. But how many pairs like this would we expect?

The starting point here is Zipf’s law for cities, or the rank-size rule. This rule states that the nth largest city in a region will have population 1/n times that of the largest city. As it turns out, this isn’t quite true for the structure of cities in the US, but they do roughly follow a power law. If we regress log(population) against log(rank), we get the regression line

\log(pop) = 15.6103 - 0.7287 \log(rank)

or, if we exponentiate both sides,

pop = 6018207 \times rank^{-.7287}

For example, we predict that the hundredth-largest city should have population 6018207 \times 100^{-0.7287} = 209926. The actual hundredth-largest city is Spokane, Washington, with population 208916. See below for a graph of city size vs. city rank:

Because I don’t want to rewrite these numbers over and over, I’m going to rewrite that as p = a r^{-b}, and plug in the numbers at the end. Now let’s invert this relationship. How many cities do we expect to have population greater than some constant p? That’s just the rank the corresponds to p;. Solving for r gives r = (p/a)^{-1/b}. Let’s write this as $r = f(p)$.

The expected number of cities having population exactly p is thern

-f^\prime(p) = a^{1/b} {1 \over b} p^{-(1+1/b)}

Taking the derivative here is actually the crux of the analysis, so I’ll elaborate a bit. The expected number of cities having population at least p is f(p); the expected number of cities having population at least p+1 is f(p+1). The expected number of cities having population exactly p, then, is f(p)-f(p+1) = -(f(p+1) - f(p)). But f(p) varies slowly so we can approximate f(p+1) - f(p) by f^\prime(p). Let g(p) = -f^\prime(p) for later ease of notation.

Roughly speaking, g(p) is the density of cities per unit population, at p. For example, if we let p = 105,000 we get that we expect 0.0034 cities of population 105,000. Extrapolating to the range from 100,000 to 110,000, we expect 10,000 times this many cities, or 34, in that population range; there are in fact 39.

So now take this expected value, and figure that the actual number of cities of population p is a Poisson random variable with mean g(p). The probability that such a random variable is equal to 2 is e^{-g(p)} g(p)^2/2. Since g(p) is very close to 0, I’ll drop the exponential term in what follows. Furthermore for ease of calculation, let’s assume these Poissons are never greater than 2. For example, the probability that a Poisson with mean 0.0034 is at least 2 is exactly

1 - e^{0.0034} (1 + 0.0034) \approx 5.767 \times 10^{-6}

and I use the approximation 0.0034^2/2 = 5.78 \times 10^{-6}. The number of pairs of cities with population greater than c and the same population is then predicted to be

\sum_{p \ge c} g(p)^2/2

but I’d rather do an integral instead of a sum, so we’ll approximate this as

\int_{c}^\infty g(p)^2/2 \: dp.

Recalling that g(p) = a^{1/b}/b p^{-(1+1/b)}, we get

\int_c^\infty {a^{2/b} \over 2b^2} p^{-(2+2/b)} \: dp

and doing the integral gives

{a^{2/b} \over 2b^2} {b \over b+2} c^{-(1+2/b)}

Plugging in the values from above, c = 100000, a = 6018207, b = 0.7287, gives 0.1924. So the expected number of such coincidences is about one-fifth; in the 2010 census it was three.

If you compare data from 2000 the first such coincidence is at rank 467 – Royal Oak, MI and Bristol, CT both had population 60,062 that year. (Note: I scanned the data by eye, so it’s possible I missed something.) You expect to start seeing coincidences this far down; plugging in c = 60000 with the 2010 coefficients gives 1.3. (Properly speaking I should use the 2000 coefficients, but I’d have to compute them first.) So 2010 is probably unusual. Still, I can’t help but suspect that the Census might be fudging the data a little bit to make these cities tie so that the lower-ranked member of each couplet doesn’t complain…

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for June 24

Sir Timothy Gowers.

Black bears have some numerical ability.

From Josh Laurito, How similar are European languages to each other? Unfortunately does not include the language(s) that I’ve heard referred to as “BCS” or, somewhat more crudely, “Bosnifuckit”. That’s “Bosnian/Croatian/Serbian”, which are three very similar languages or three dialects of the same language depending on who you ask.

Ownership and control at Square, at Rotary Gallop via Hacker News. Rotary Gallop is applying something like the Banzhaf power index to corporate ownership structures.

DarwinTunes, in which pieces of music make baby music.

Lexicon Valley on stylometry.

William Wu‘s gallery of fractals, mostly by Paul DeCelle, and Wu’s introductory comments on fractals and brief explanations of the Mandelbrot set and Sierpinski triangle.

Davantage de régularité dans les naissances ?

Mike Bostock has a series of posts on visualization methods: fisheye distortion and other ways to distort plots so that when you drag the cursor over them the part near the cursor is magnified, the Les Misérables adjacency matrix, hive plots for dependency graphs, chord diagrams for Uber car service data.

Nate Silver on Calculating “House Effects” of Polling Firms.

Mark Dominus explains linear regression in a math.stackexchange answer.

A crowdsourced survey of adjuncts reports that adjuncts don’t make much money. I’m disinclined to trust the exact numbers, but the more general results are sobering.

A product rule for triangular numbers. It turns out that triangular numbers satisfy the rule T(mn) = T(m) T(n) + T(m-1) T(n-1) (and there’s a nice pictorial proof of this); are they the only such sequence?

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Turing centenary coverage by the BBC

Tomorrow is the centenary of Alan Turing’s birth. His biographer Andrew Hodges, author of Alan Turing: The Enigma (just out in a new Centenary Edition), has written a brief piece for the BBC, accompanied by a video of talking heads this centenary conference. This is part of a larger section of essays about Turing the BBC have been running this week, also accompanied by short videos. These are:

Monday: Vint Cerf, why the tech world’s hero should be a household name
Tuesday: Jack Copeland, The codebreaker who saved ‘millions of lives’
Wednesday: Simon Lavington, is he really the father of computing?
Thursday: Noel Sharkey, the experiment that shaped artificial intelligence
Friday: Andrew Hodges, Gay codebreaker’s defiance keeps memory alive

Tomorrow is also the first day of SF pride, which strikes me as the sort of event that Turing would not have been particularly interested in.

Edited to add, 1:22 pm: San Francisco startup Prior Knowledge made a birthday cake for Turing, which happens to have the colors of the rainbow on it. Their founder, Eric Jonas, says that his “long term goal is to use the Reverend Thomas Bayes to defeat the Reverend Thomas Malthus”, which sounds pretty awesome.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

How fast was I going?

About halfway between Charlotte and San Francisco, I found myself staring out the window. Because airplanes don’t exist to amuse me but rather to get their passengers from one place to another as cheaply as possible, there is no in-flight video entertainment system. And if there were an in-flight video entertainment system, it wouldn’t include the channel that tells you where the plane is and how fast it’s going.

Fortunately, if you’ve flown over (or driven through?) this part of the world, you realize that the ground is essentially a giant checkerboard. See for example this image of Kansas crops from Wikipedia. So if you know how big the checkerboard squares are, and you have a stopwatch, you can figure out how fast you’re going. Just hold your head steady and watch how many of the little squares on the ground pass by in a given amount of time. (This is hard if there’s turbulence.)

In my case I observed that we crossed ten such squares heading roughly parallel to the direction of the plane, and one such square heading roughly perpendicular to the direction of the plane, in 34 seconds. I know — from basic geography — that the plane is traveling roughly west. I cover \sqrt{10^2 + 1^2} \approx 10.05 squares every 34 seconds, or \sqrt{10^2+1^2}(3600/34) \approx 1060 squares per hour. (In my head I actually just did 10 \times (3600/34), the extra 1 being basically superfluous at this level of precision.

But how big are the squares? This is the one piece of knowledge that I couldn’t get from the air. They’re half-mile squares. I had actually thought they were one-mile squares, remnants of the Public Land Survey System — and indeed somewhat west of where I noticed this the squares did turn into one-mile squares before they disappeared completely — but 1060 miles per hour was clearly too fast. The squares had to be some simple fraction of a mile, though, so we were traveling at about 530 miles per hour. Furthermore, for every ten squares moved west we moved one square north; so our heading was about one-tenth of a radian, or six degrees, north of west.

I didn’t note the time exactly, but it was perhaps 5:40 Pacific daylight time when I made this observation, and I’m guessing we were somewhere over southern Kansas. If you look at the flight plan for this flight and plot the appropriate piece of it you can see we would have been flying just north of west at that time; I don’t know how to get the speed from publicly available data.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for June 17

Rick Wicklin of SAS gives Eight tips to make your simulation run faster.

Analyzing the chords of 1300 popular songs for patterns,the first blog post of hooktheory, which “teaches the theory behind popular music for songwriters and musicians.” Via Hacker News. If you know some music theory you won’t be surprised, but it’s interesting.

Larry Wasserman on the difference between statistics and machine learning.

Why are no-hitters on the rise recently?

Steven Pinker on the number of people that have lived and died in the 20th century.

Five mathematical subjects that could be taught in elementary school or high school, but aren’t.

Some graphs of number of births by day of year.

Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users. 44-page review article, looks mostly non-technical, by Shuai Yuan, Ahmad Zainal Abidin, Marc Sloan, and Jun Wang. via technology review blog

This American Life on Chernoff faces.

Gasarch examined the odds of Kentucky Derby longshots.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.