My wife sent me this tweet by David M. Wessel this morning. It’s a photograph of a presentation slide giving three definitions of data scientists:

“A data scientist is a statistician who lives in San Francisco.
Data science is statistics on a Mac.
A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”

Now, at my last job I lived in San Francisco, used a Windows machine, and was called a “quantitative analyst”. Now I live in Atlanta, use a Mac, and am called a “data scientist”.

(Oh, yes. I forgot to mention that. In the turmoil of a cross-country move blogging fell by the wayside. I’m hoping to get back in the habit.)

My conclusion (n = 1) is that the “uses Mac” variable has a higher weight than the “lives in San Francisco” variable. This may actually be true; a lot of data scientists are using Unix tools and those in general integrate better with Macs.

A final question: where are these quotes originally from?

It looks like the Mac quote is from big data borat in August 2013.

The last quote (slightly rephrased) is probably due to Josh Wills in May 2012.

In a Quora answer from January 2014, Alon Amit attributes the San Francisco quote to Josh Wills, who says he was riffing on nivertech saying “”Data Scientist” is a Data Analyst who lives in California.” Most of the google hits for this quote are from January through March of 2014 but I feel like I heard it earlier; can anyone find a better citation?

A recent blackjack rule change at a couple Vegas casinos, reported in Business Insider: a “natural” blackjack (that is, being dealt two cards that sum to 21) will now pay out at 6:5 odds instead of 3:2. For those not familiar with blackjack: in blackjack, an ace can count as 1 or 11, and 10, jack, queen, or king all count as 10. So to get 21 you have to be dealt one of the eight pairs

(A, 10), (A, J), (A, Q), (A, K), (10, A), (J, A), (Q, A), (K, A).

There are 169 possible pairs (I’m ignoring the issue of sampling with or without replacement, or alternatively working with a shoe with infinitely many decks), so the odds of being dealt a natural blackjack are 8 in 169. The payout on a bet of 1 goes from 1.5 to 1.2, so this raises the house edge by (0.3)(8/169) = 1.42%. Given the typically narrow house edge in blackjack, that’s quite a change – certainly more than I expected from hearing it until I did the math.

Foreknowledge of those shapes, she explained, could lead to a breakthrough phenomenon she described as “a perpetual Tetris” of unlimited duration.

“While this remains entirely hypothetical at this moment, there exists a theoretical point at which the elimination of bottom rows occurs with such speed and efficiency that there is always enough room at the top of the matrix to accommodate new pieces,” Edelman said.

This is, surprisingly, a question about random number generators. It turns out that if you get 70,000 consecutive Z or S pieces, then you’re guaranteed to lose – try it out with Heidi Burgiel’s Java applet or the accompanying paper. Since that number is not zero, this will almost surely happen in an infinite “idealized” Tetris game. (But, of course, Tetris doesn’t have a perfect random number generator; as the Wikipedia article points out, the generator that is used repeats its numbers with small enough period that this almost certainly doesn’t happen.)

Are there any other examples of “real” math hiding in the Onion?

A list of famous quotes about statistics. I actually used the Fisher quote, “To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of”, in an e-mail to a colleague today; I believe I first saw this quote in John Cook’s blog.

It’s been a while. I blame the holidays and some Secret Big News.

Facebook’s data science team has an interesting post on the age difference between two people in a relationship. Fun fact: the average age difference in same-sex couples (of either sex) is much larger than that in opposite-sex couples. Why? I can think of two reasons:

(1) the size of the pool of potential mates is smaller for same-sex couples than for opposite-sex couples. Therefore individuals in same-sex couples have to compromise more on other dimensions, like age.

(2) the idea that both partners in a relationship should be of the same age is “conventional”, and people who are in same-sex relationships (an unconventional choice, if strictly for numerical reasons) are likely to make unconventional choices about other aspects of their relationships as well.

One possible way to find evidence for (1): is the difference between same-sex and opposite-sex couples larger in areas where there are less same-sex couples? If so, this is evidence for the “compromise” hypothesis – where there are less same-sex couples there ought to be more compromising along other dimensions. (Similarly, are the members of same-sex couples more different on other dimensions – such as educational status, race, religion, and so on – in areas with less same-sex couples?) It seems more difficult to find a way to test (2).

It’s the hundredth anniversary of the publication of the first crossword – check out today’s Google Doodle.

On a related note, crosswords are possible in English (or other natural languages) because a large enough proportion of the possible strings of letters are actual words. I learned this from chapter 18 of Information Theory, Inference, and Learning Algorithms by David Mackay (which you can read online). (Chapter 19 is about why to have sex, from an information-theoretic point of view.)  And Dr. Fill is a crossword-solving program by Matthew Ginsberg which did not win the 2012 American Crossword Puzzle Tournament.

This made the rounds last week: Substantiating Fears of Grade Inflation, Dean Says Median Grade at Harvard College Is A-, Most Common Grade Is A, from the Harvard Crimson.

Now, I agree that an A-minus is probably too high here. (Although Jordan Ellenberg says we shouldn’t worry about grade inflation.)

But does it really matter that the most common grade is an A? Consider, say, a situation where there is a “triangular” distribution of grades: 5 A, 4 B, 3 C, 2 D, and 1 F. The most common grade is an A, but the median is a B (and the mean is 2.67 on a 4.0 scale, a B-minus). If there are more grade categories the same thing happens – if we have a triangular distribution of grades such as this,  the median grade $1/\sqrt{2} \approx 0.71$ of the way up — about midway between a B-minus and a B on the 4.0 scale usual in the US. The mean grade would be $2/3 \approx 0.67$ of the way up the scale.  More generally, say grades are in the interval [0, 1].  If grades are beta-distributed with parameters 1 and $\beta > 1$ (my triangular idea is just the Beta(1, 2) distribution) then the modal grade will be 1 but the mean and median will be a good bit lower, $\beta/(\beta+1)$ and $2^(1/\beta)$ respectively.

(I’m not claiming that grades are beta-distributed, but that’s not a bad model for something that’s often thought of as being roughly normally distributed but has to be contained within an interval.)

Basically, modes don’t tell you much.