Income inequality, social mobility, and sample size

Matt O’Brien at the Washington Post’s Wonkblog has an infographic that contains the following information:

quintile of income distribution first second third fourth fifth
% of college graduates from poor families 16 17 26 21 20
% of high school dropouts from rich families 16 35 30 5 14

This comes from a paper entitled Equality of opportunity: definitions, trends, and interventions by Richard V. Reeves and Isabel V. Sawhill. The second row is from their figure 10, the first from their figure 11. Rich and poor families are those in the top and bottom income quintiles; the table is looking at their children’s income at age 40.

The interpretation that O’Brien suggests is that “Even poor kids who do everything right don’t do much better than rich kids who do everything wrong. Advantages and disadvantages, in other words, tend to perpetuate themselves. ”

And that is true, but there’s something interesting I can’t help but see here – the distribution of incomes for high school dropouts from rich families appears to have two peaks. Are there some of these “rich” who have gotten a leg up from their families while others didn’t? More likely, though, is that the sample size involved is just too small to make detailed claims like this. (And the 80th percentile is hardly rich.). I bet it’s possible to pull off something like that in a society with multiple castes that hardly overlap, but that’s not the situation in the US – we have a lot of income inequality but there are smooth gradations between the different segments of the income distribution.

Polling and the wisdom of crowds

From The Fix at the Washington Post: Americans think the Republicans will win control of the Senate. See also the New York Times’ Upshot, which references this paper by David Rothschild and Justin Wolfers. In some sense, by asking me who I think is going to win an election you’re looking at not just who I’m going to vote for but who I think my friends are going to vote for, from talking to them.  For example, if hypothetically I’m part of one party’s base but I know a lot of swing voters, I might think of who my swing-voting friends say they’re going to vote for and say that that candidate will win.

Essentially you’re inviting me to construct an ad hoc estimator of how the election will turn out by observing my social network. My own voting behavior is a biased estimator of the final election results; explicitly inviting me to think about what will happen invites me to remove that bias.

Links for October 26

Mona Chalabi at FiveThirtyEight on queueing theory as applied to grocery stores.

Heuristics for estimating life expectancy, from Decision Science News.

Natalie Wolchover at Quanta Magazine, At the Far Ends of a New Universal Law, on the Tracy-Widom distribution from random matrix theory.

How to tell the temperature using crickets from Priceonomics. (Supposedly this eventually goes back to the Arrhenius equation but a quick Google only finds me unsupported claims of this fact. Google Scholar is a little better.)

Better Explained has an interactive guide to the Fourier transform and the law of sines. (I bet at least one of these is old, but I came across them this week.)

Colm Mulcahy has rounded up a bunch of Martin Gardner’s puzzles for the BBC, in honor of the 100th anniversary of Gardner’s birth, and also his top ten Scientific American columns The CBC program The Nature of Things did a show on Martin Gardner in 1996.

I’m still working out what to think of this app that solves math problems by pointing your phone at them.

Tiny Data, Approximate Bayesian Computation and the Socks of Karl Broman applies Bayesian computation to doing the laundry.

A book on the foundations of data science (high-dimensional geometry, Markov chains, etc.) by John Hopcroft and Ravindran Kannan of Microsoft Research is available online.

Users of MathOverflow have compiled a list of obscure names in mathematics, i. e. theorems whose names don’t tell you what the theorem is about or who discovered it.

Michael Jordan is interviewed by IEEE Spectrum and comments on how that process was disillusioning.

When to buy airplane tickets

From Yahoo Travel: what day of the week to buy airplane tickets for the best deal. Short version: round trip domestic airfares average about $430 on weekends and about $500 on weekdays, so buy on the weekend. The Yahoo piece is, in turn, a condensation of this piece from the Wall Street Journal. The WSJ piece acknowledges that a portion of this is because price-insensitive business travelers buy on weekdays and price-sensitive leisure travelers buy on weekends.

(Are business travelers really price insensitive? Sure doesn’t seem like it where I work, and lots of places have policies that basically require the employee to book at the lowest price unless they jump through a whole bunch of bureaucratic hoops. Whereas if I’m an individual buying a ticket, I can pay a little more for the more favorable schedule without asking anyone. But I digress…)

Seems to me that the big elephant in the room is that business travelers travel on different routes than leisure travelers. And if I’m trying to buy plane tickets for myself and hoping to be able to time this purchase, I don’t care what prices I could get on other tickets being bought by other people on the same day as me.

We could reproduce the phenomenon these articles are showing as follows. Imagine an airline with two routes. Say that tickets on route A, a business-heavy route, cost 600 dollars regardless of the day, and on route B, a leisure-heavy route cost 300 dollars. On weekdays, two-thirds of tickets purchased are on A and the average price is 500; on weekends only half of tickets are on A and the average price is 450. This whole thing may be a less severe form of Simpson’s paradox – I’m saying it’s less severe because a true Simpson’s paradox would have it actually being more expensive to buy tickets on the weekend for any given route.

It’s not impossible that it’s actually cheaper to buy tickets for a given route on the weekend – but looking at simple averages won’t prove it.

Simulating a bet on a whole series from bets on individual games

From Mind Your Decisions, a puzzle about gambling:

Your friend wants to make an even-payoff bet on the outcome of the entire World Series. That is, he wants to make a $100 bet so that if his team is the champion he will win $100, and if his team loses he will lose all of his money.

The problem is he uses a bookie that takes bets only on individual games, and not the entire outcome. The bookie is, however, offering even-payout bets for each game and for any dollar amount.

How much should your friend bet on each game so that he can simulate an even-payout $100 bet on the outcome of the entire series?

For notational simplicity, I’m going to measure money in units of $100, so you start with 1. And for concreteness, let’s say you want to bet on the Giants against the Royals. (I used to live in San Francisco and have never been anywhere near Kansas City.) The goal is to put together a series of bets that will leave you with 2 if the Giants win and 0 if they lose.

The “probabilities” that I’m going to mention are probabilities computed as if all games are independent and equally likely to be won by both teams; of course this is not true in reality. (The finance folks have a name for this; it’s been a while since I looked at any finance. What is it?)

The answer can be summarized as follows: To determine what to bet on the Giants in game n, before game n but after game n-1:

  • determine the probability that the Giants will win the series if they win game n; call this p^+;
  • determine the probability that the Giants will win the series if they lose game n; call this p^-;
  • bet p^+ - p^-

Now, note that the winning probability before game n must be p = (p^+ + p^-)/2.

By following this strategy, if the Giants win your bankroll goes up by p^+ - p^-, and the probability of the Giants winning goes up by p^+ - p or (p^+ - p^-)/2; that is, the change in your bankroll is twice the change in probability. This is also true if the Giants lose. At the beginning your bankroll is 1 and the probability of a Giants win is 1/2, so your bankroll is always twice the win probability. In the end it’s 2 if the Giants win and 0 if they lose, simulating the desired bet.

On a related note, people’s guesses about how scores proceed in an NFL game are wrong.

Fund Samuel Hansen’s kickstarter

Hopefully this isn’t too little, too late: you should fund Samuel Hansen’s kickstarted Relatively Prime: Series 2, an excellent series of long-form “stories from the mathematical domain”. Samuel is the creative force behind such excellent science podcasts as Combinations and Permutations, Strongly Connected Components, Science Sparring Society, and (with Peter Rowlett) Math/Maths, and he did a series of Relatively Prime a couple years ago, so you know it’ll be good.

And because I know you were wondering, there’s a site that can tell you the probability that a Kickstarter will be funded.

Margins of error on Atlanta-area traffic signs

Every work day, in the evening on the way home, I pass a sign on Georgia 400 a few miles north of I-285. On a good day it will read something like:

“I-285: 4-6 MIN / I-85: 11-13 MIN”

and on a bad day it’ll read something like

“I-285: 10-12 MIN / I-85: 32-34 MIN”

I figure this sign is somewhere in Sandy Springs, Georgia, although it may be in Roswell, the next city north; see a google map. There are other, similar signs that I also pass on my commute but this is the one I pay attention to.

But what’s interesting about these signs is that, no matter how long they claim the drive will take, the range is always two minutes wide. And you’d expect that the error on the distance from my sign to I-285 would be smaller than my sign to I-85 – the drive to I-285 is a part of the drive to I-85, and it would be quite strange for errors on estimates in the segment from the sign to I-285 to be negatively correlated with errors on the segment from I-285 to I-85. Presumably these one-minute “errors” are purely cosmetic, in order to remind people that these estimates are not always correct.  I assume that there is some internal estimate of the margin of error in the system doing the estimation, though, presumably calibrated on past estimates – why not just use this?  Although this would perhaps be a level of sophistication beyond what people are used to handling.  In weather forecasting, for example, we regularly see probabilities of precipitation, but not error bars around temperature forecasts.

Another, less mathematical, thing about these signs: where Georgia 400 crosses I-285, in the morning (when traffic is generally heading towards 400, not away from it) a sign often reads “I-285 SPEEDS : EAST 55+ MPH WEST 55+ MPH”. I suppose they don’t want to just come out and admit that people go at least 70 when there’s no traffic; the speed limit is 65 in good conditions.

Links for October 19

John Cook spoke at KeenCon on Bayesian statistics as a way to integrate intuition and data.

A fantasy sports wizard’s winning formula, from Brad Reagan at the Wall Street Journal. Via Hacker News.

From Dan Egan of Betterment, It’s About Time in the Market, Not Market Timing, an analysis of the distribution of returns based on investment period.

Sarah Fallon at Wired:
This Man’s Simple System Could Transform American Medicine
. The man is David Newman, professor of emergency medicine at Mount Sinai Hospital, and the system is based on the number needed to treat (i. e. how many patients do you have to treat to have one positive outcome?)

A talk from Jake Porway of DataKind on using data for good, and Defending microfinance with data science by Everett Wechtler at Bayes Impact.

Jordan Ellenberg, author of How Not to Be Wrong, speaks to students in Education as Self-Fashioning at Stanford University on 10-10-14.

Tom Siegfried on the top ten unsung geniuses of science and math, at Nautilus.

Michael Byrne at Motherboard on a model that has predicted the Ebola outbreak, following An IDEA for Short Term Outbreak Projection: Nearcasting Using the Basic Reproduction Number by David N. Fisman, Tanya S. Hauck, Ashleigh R. Tuite, and Amy L. Greer. Also in Ebola news, let’s do some math on Ebola before we start quarantining people, on the old question of false positive rates in medical tests.

Tyler L Hobbs on probability distributions for algorithmic artists and randomness in the composition of artowrk.

The MAA has released some long-lost Martin Gardner footage.

State population and area

Ork Posters makes maps of cities with their neighborhoods. I have three: Boston, Philadelphia, San Francisco. I was looking at them recently and noticed one thing these three cities have in common: the more centrally located neighborhoods are physically smaller than the outlying neighborhoods.

Is this a general trend? It’s hard to tell because neighborhoods don’t “officially” exist. Perhaps we can do something similar looking at states.

The first question is: what’s the “central” location to use? Since states have historically been formed from the capital, i. e. Washington, DC, I plotted land area against the distance from the state’s largest city to DC. (This was the driving distance as given on Google Maps, except for Hawaii where I used flight distance from


This confirms what you see on a map – states get bigger as you move away from the capital. There are basically two clusters:

– the “main” cluster stretching from Pennsylvania to California;
– the “northeastern” cluster, where states are smaller than you’d expect from their distance to DC.

The existence of this “northeastern” cluster suggests that it might have made more sense to use a point further north – Philadelphia or even New York – for these earlier states. These states were formed as colonies, before the United States had a capital or was even a thing – but New York and Philadelphia both had their turns as capital before Washington, DC existed.

As you probably could have guessed beforehand if you know anything about the United States, Texas, Alaska, and Hawaii are outliers.
More distant states turn out also to have lower population. (Not just lower population density.)


Incidentally, at the county level, Ed Stephan observed that some states have variation in county size and some don’t, and that much of this seems to be explainable by population density changes. So it is very possible that the observation that more remote states tend to be larger may hold because remoteness from DC, in the United States, is correlated with lower population density. That’s a fancy way of saying that as you go west there are less people, which if you ignore the western coast is true.

As for my original question: how would you answer it, given that there aren’t official definitions of neighborhoods?

Election forecasting and big data

Matt Yglesias writes on what he calls the hazy metaphysics of probability as it applies to election forecasting. For example, as of right now, Nate Silver’s forecasts have a 42.2% chance of the Democrats (plus independents) getting the majority in the Senate, while Sam Wang’s forecasts have that same number at 39%. As Yglesias points out, we’ll just never get enough data points to know which of these models is closer to the truth. (By the time we do, the political system will change underneath us.)

Yglesias’ vox has a roundup of various forecasts as well. (Silver is the fox logo; Wang is the orange-and-black Princeton shield.) One thing that jumps out, for me, is that the Washington Post tends to give much more extreme win probabilities – that is, closer to 0 or 100 percent – than the other models. I suspect that the models with more extreme win probabilities generally have the same point estimate – in all these models, the point estimate is basically the average of poll results, although of course you can quibble about what polls to average and how to weight them and so on. The secret sauce of any of these models is going to be how accurately the results today – nearly a month before the election – predict the results on Election Day. Silver claims Wang has historically gotten this wrong in his explanation of the models he uses, and critiqued Wang at Political Wire; here’s Wang’s response.

This is one of those places where the end of theory folks are just wrong. In some sense election forecasting is “big data” – at least, Silver, Wang, and the like are trying to predict a big data result (millions of votes), although the samples aren’t so large. We don’t get to run the elections over and over again – or, to go bigger, we get one shot at getting this climate change thing right. See for example Big Data: the end of theory in healthcare?, by Ben Wanamaker and Devin Bean. And there’s a reason that in my big data job I work with a bunch of people who were academic scientists in their former lives.