Power laws and wealth

From Alison Griswold at Slate, reporting on the Wealth-X and UBS billionaire census (warning: obnoxious auto-playing music at the second link): “The typical billionaire has a net worth of $3.1 billion.”

Does “typical” mean mean? or median? It appears that “mean” is intended, because the front page of this census says there are 2,325 billionaires globally, with a combined net worth of 7.3 trillion dollars; the quotient is just around 3.1 billion.

Furthermore, wealth supposedly follows a Pareto distribution – the number of people wealthier than x is proportional to x^{-\alpha}. (But note that this may not be true; in general identifying power laws is tricky. But let’s play along, and observe that:

  • the median of a Pareto distribution is x_m 2^{1/\alpha}. Let x_m = 1 (i. e. measure money in units of billions of dollars) and you get that alpha = log(2)/log(3.1) \approx 0.61 , if “typical” means median.
  • the mean of a Pareto distribution is \x_m (\alpha/(\alpha-1)), so you get \alpha/(\alpha-1) = 3.1, or $\alpha = 31/21 \approx 1.48$, if “typical” means mean.

These two parameters are very different! In particular, with the parameter \alpha = 1.48 (derived from assuming the mean billionaire has a net worth of 3.1 billion), 81 percent of billionaires have less net worth than what the article calls the “typical” billionaire, and the median billionaire has a net worth of “only” 1.6 billion. In contrast, a Pareto distribution with \alpha < 1, such as any one where the median is at least twice the minimum, doesn’t even have a well-defined mean. (Of course the actual distribution of billionaire net worths has a well-defined mean, whatever it is, because there are a finite number of them.)

The original survey also mentions that there’s a “wealth ceiling” around 10 billion USD; see the plot at quartz. But I don’t see any really clear evidence for this. There could be such a ceiling, though, a function of the size and growth rate of the world economy, the typical length of human lives, tax rates on the income of the very wealthy, and so on.

DC statehood, 51-star flags, and models of what will pass Congress

Is D. C. Statehood a matter of civil rights?, by Andrew Giambrone in The Atlantic

I know, what does this have to do with math?

Well, you could read Chris Wilson’s article for Slate on Puerto Rico statehood back in 2010, in which he writes about possible flag designs; we’d probably end up going with alternating rows of nine and eight stars, one of the options Skip Garibaldi identified.

But what I’m actually writing about is what I saw when I followed the link in that article to the govtrack.us page on the New Columbia Admission Act. This gives the following prognosis: “64% chance of getting past committee,
17% chance of being enacted.”
(Disclaimer: govtrack.us is the work of Joshua Tauberer, who I knew in grad school, in the sense that we had some mutual friends and have been in the same room at the same time.)It turns out this comes from logistic regression models, trained on the 2011-2013 Congress. The linked page there explains the models, and gives a list of the features looked at and their weights. There are models for both getting out of committee and being enacted.  Somewhat amusingly, the feature with the highest positive weight in both of these “Title starts with ‘To designate the facility of the United States Postal'”, which refers to bills like this one that name post offices. In the particular case of this bill the prognosis comes from some more substantial features, though, having to do with sponsorship and committee membership and the like.

Note that the model doesn’t look at the text of the bill.  And it need not – we already have sophisticated textual analysis modules in the guise of Congresspeople and their staffs.  In looking at sponsorship data, this is an example of an ensemble model, which combines multiple models (the individual Congresspeople).
govtrack.us also offers analyses of ideology of Congresscritters (based on cosponsorship, using singular value decomposition) and leadership (based on cosponsorship again, using PageRank). As always, it’s good to see these statistical techniques being used to analyze things that matter.

A list of fifteen books that make up Gardner’s “canon”

I know I can’t be the only person who’d like to see this: a list of the fifteen books that make up Martin Gardner’s body of Scientific American columns.  I’ve been thinking for a while that I’d like the full set – I had a couple of the books when I was young and liked them quite a bit – but had been hampered by trying to find the whole list

The first four have been reissued by the MAA (the site I linked to, martin-gardner.com, has three, but The Unexpected Hanging and Other Mathematical Diversions, the fourth book, has been rereleased very recently as Knots and Borromean Rings, Rep-Tiles, and Eight Queens: Martin Gardner’s Unexpected Hanging, part of the MAA’s series of rereleases.  I don’t know if they intend to get through the whole set; Gardner updated them with some new information, so what will happen after his death?

Language Log on “specificity” and “sensitivity”

Language Log on “specificity” and “sensitivity” as (poorly chosen words for) properties of medical tests. Mark Liberman asks: why not just call them true positive rate and true negative rate. With the classic “what’s the probability that you got the disease, given that you tested positive?” problem thrown in; you’ve seen this if you ever learned Bayes’ theorem.

Explaining banding in a scatterplot of Goldbach’s function

David Radcliffe asks for an explanation of the “bands” in the scatterplot of the number of solutions to p + q = 2n in primes. To give an example, we have

2 × 14 = 28 = 23 + 5 = 17 + 11 = 11 + 17 = 5 + 23
2 × 15 = 30 = 23 + 7 = 19 + 11 = 17 + 13 = 13 + 17 = 11 + 19 = 7 + 23
2 × 16 = 32 = 29 + 3 = 19 + 13 = 13 + 19 = 3 + 29

and so, denoting this function by f, the elements of this sequence corresponding to n = 14, 15, 16 are f(14) = 4, f(15) = 6, and f(16) = 4 respectively. (Note that the counts here are of ordered sums, with 23 + 5 and 5 + 23 both counting; if you use unordered sums everything works out pretty much the same way, since every sum except those like p + p appears twice and I’m going to talk about ratios and inequalities and the like.)

My re-rendering of something similar to the original scatterplot is here:

plot-1

and there are heuristic arguments that f(n) \approx n/(log(n)^2)), so let’s divide by that to get a plot of the “normalized” number of solutions:

plot-2

There are definitely bands in these plots. Indeed the situation for n = 14, 15, 16 is typical: f(n) “tends to be” larger when n is divisible by 3 than when it isn’t. A handwaving justification for this is as follows: consider primes modulo 3. All primes (with the trivial exceptions of 2 and 3) are congruent to 1 or 5 modulo 6, and by the prime number theorem for arithmetic progressions these are equally likely. (For some data on this, see Granville and Martin on prime races, which is a nice expository paper.) So if we add two primes p and q together, there are four equally likely cases:

  • p is of form 3n+1, q is of form 3n+1, p+q is of form 3n+2
  • p is of form 3n+1, q is of form 3n+2, p+q is of form 3n
  • p is of form 3n22, q is of form 3n+1, p+q is of form 3n
  • p is of form 3n+5, q is of form 3n+2, p+q is of form 3n+1

So if we just add primes together, we get multiples of three fully half the time, and the remaining half of the results are evenly split between integers of forms 3n+2 and 3n+1.

We can make the bands “go away” by plotting, instead of f(n), the function which is f(n)/2 when n is divisible by 3 and f(n) otherwise. Call this f_3(n). But there’s still some banding:

plot-3

Naturally we look to the next prime, 5. A given prime is equally likely to be of the form 5n+1, 5n+2, 5n+3, or 5n+4; if we work through the combinations we can see that there are 4 ways to pair these up to get a multiple of 5, and 3 ways to get each of the forms 5n+1, 5n+2, 5n+3, 5n+4. So it seems natural to penalize multiples of 5 by multiplying their $f(n)$ by 3/4; the banding then is even less strong, as you can see below.

plot-3b

The natural thing to do here is to just iterate over primes. For the prime p we get that there are p-1 ways to pair up residue classes 1, 2, \ldots, (n-1) \pmod p to get the residue class 0 (i. e. multiples of p) and p-2 ways to get each of the classes 1, 2, \ldots, n-1. That is, multiples of 2p are more likely than nonmultiples to be sums of randomly chosen primes, by a factor of (p-1)/(p-2). Correcting for this, let’s plot x against

f^*(n) = f(n) \times \left( \prod_{p|n} {p-2 \over p-1} \right);

in this case you get the plot below. The lack of banding in this plot is basically the extended Goldbach conjecture.

plot-4

Although I didn’t know this when I started writing, apparently this is known as Goldbach’s comet: see e. g. Richard Tobin or Ben Vitale or this MathOverflow post.

And although this is a number-theoretic problem, much of this is an exercise in statistical model fitting; I proceeded by making a plot, checking out the residuals compared to some model to see if there was a pattern, and fitting a new model which accounted for those residuals. However, in this case there was a strong theory backing me up, so this is, thankfully, not a pure data mining exercise.

When does fall really start?

This year the autumnal equinox – which marks the point when the sun crosses the celestial equator – falls on September 22 in the United States (10:29 PM Eastern time, and earlier for the other time zones.) People seem to refer to this as the “first day of fall”.

But this is the astronomical definition. Meterologists define summer as June through August, and fall as September through November (so it’s been fall for a while now). I noticed this this morning when on my local news they were talking about today as the “last full day of summer” in reference to the weather. But if summer is to be defined meteorlogically, we can take a look at the climate normals for Georgia (thanks to Golden Gate Weather). (I’m using Atlanta Hartsfield Airport here.) The hottest days of the year are July 16 through 19, with an average high of 89.3 degrees, in agreement with NOAA’s warmest day of the year map. If we want a period of three months that’s the hottest possible, we should find two days three months apart which have the same normal high temperature; these are June 7 and September 7. This is the period of the year where the average high temperature is above 85 degrees.

We could find “winter” the same way; a three-month winter in Atlanta would be November 27 through February 27, when average highs are below 59.5 degrees. Fall and spring are intermediate between those. However, in snowy places I’d be a bit more inclined to define winter as “the time when it snows a lot”. You may recall that Atlanta is not a snowy place.

The benefit of thinking of things this way is that it captures the insight that seasons come “late” or “early” in certain places. Take for example a San Francisco summer; late September is actually the hottest time of year in San Francisco. Indeed the warmest three-month period in San Francisco is August 2 through November 1 (when highs are 66.7 or higher), which corresponds with my intuition that July just isn’t summer there. The coldest three-month period is November 23 through February 23 (when highs are 61.1 or lower). In other words, fall in San Francisco is a few weeks in November, and spring lasts nearly half the year. San Francisco, by the way, is less snowy than Atlanta.

For climate charts that are prettier than anything I could make on the fly, see WeatherSpark for Atlanta and San Francisco. These dates that I’ve given roughly correspond with the “cold season” and “warm season” they report, although not exactly because they don’t appear to have constrained the lengths of those periods.

How the 538 model works

Here’s a explanation by Nate Silver of how his Senate prediction model works. 10,000 words, and denser than the typical FiveThirtyEight post, but it’s food for thought if you’ve been curious about what’s going on under the hood of FiveThirtyEight’s flagship product.

Make sure to click through to the footnotes – lots of links to subsidiary analyses from the past that explicate some of the interesting tidbits Silver and co. have built up over time.

A data scientist is…

My wife sent me this tweet by David M. Wessel this morning. It’s a photograph of a presentation slide giving three definitions of data scientists:

“A data scientist is a statistician who lives in San Francisco.
Data science is statistics on a Mac.
A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”

Now, at my last job I lived in San Francisco, used a Windows machine, and was called a “quantitative analyst”. Now I live in Atlanta, use a Mac, and am called a “data scientist”.

(Oh, yes. I forgot to mention that. In the turmoil of a cross-country move blogging fell by the wayside. I’m hoping to get back in the habit.)

My conclusion (n = 1) is that the “uses Mac” variable has a higher weight than the “lives in San Francisco” variable. This may actually be true; a lot of data scientists are using Unix tools and those in general integrate better with Macs.

A final question: where are these quotes originally from?

It looks like the Mac quote is from big data borat in August 2013.

The last quote (slightly rephrased) is probably due to Josh Wills in May 2012.

In a Quora answer from January 2014, Alon Amit attributes the San Francisco quote to Josh Wills, who says he was riffing on nivertech saying “”Data Scientist” is a Data Analyst who lives in California.” Most of the google hits for this quote are from January through March of 2014 but I feel like I heard it earlier; can anyone find a better citation?

Change in blackjack odds

A recent blackjack rule change at a couple Vegas casinos, reported in Business Insider: a “natural” blackjack (that is, being dealt two cards that sum to 21) will now pay out at 6:5 odds instead of 3:2. For those not familiar with blackjack: in blackjack, an ace can count as 1 or 11, and 10, jack, queen, or king all count as 10. So to get 21 you have to be dealt one of the eight pairs

(A, 10), (A, J), (A, Q), (A, K), (10, A), (J, A), (Q, A), (K, A).

There are 169 possible pairs (I’m ignoring the issue of sampling with or without replacement, or alternatively working with a shoe with infinitely many decks), so the odds of being dealt a natural blackjack are 8 in 169. The payout on a bet of 1 goes from 1.5 to 1.2, so this raises the house edge by (0.3)(8/169) = 1.42%. Given the typically narrow house edge in blackjack, that’s quite a change – certainly more than I expected from hearing it until I did the math.

Real math hiding in the Onion?

From the Onion: Modern Science Still Only Able To Predict One Upcoming Tetris Block.

Foreknowledge of those shapes, she explained, could lead to a breakthrough phenomenon she described as “a perpetual Tetris” of unlimited duration.

“While this remains entirely hypothetical at this moment, there exists a theoretical point at which the elimination of bottom rows occurs with such speed and efficiency that there is always enough room at the top of the matrix to accommodate new pieces,” Edelman said.

This is, surprisingly, a question about random number generators. It turns out that if you get 70,000 consecutive Z or S pieces, then you’re guaranteed to lose – try it out with Heidi Burgiel’s Java applet or the accompanying paper. Since that number is not zero, this will almost surely happen in an infinite “idealized” Tetris game. (But, of course, Tetris doesn’t have a perfect random number generator; as the Wikipedia article points out, the generator that is used repeats its numbers with small enough period that this almost certainly doesn’t happen.)

Are there any other examples of “real” math hiding in the Onion?