Why do statisticians answer silly questions that no one ever asks?

Via @TimHarford, an article from SignificanCe magazine: Why do statisticians answer silly questions that no one ever asks?, by Matt Briggs. People want to predict the future, and p-values and classical hypothesis testing are not really meant for that. Bayesian methods do better.

(I apologize: the original version of this post had some HTML errors. I’m still getting used to the wordpress software.)

School shouldn’t be an arms race

Stephen Bainbridge, law professor at UCLA, asks an interesting question: should students profit off my classes? (by selling their notes). He writes about his strategy: “I’m going to buy some of these note sets and outlines being sold for my classes. I’ll go through them and find all the mistakes. And then I’ll write exam questions testing on those very same mistakes.”

I can see the appeal of this, from a purely mercenary point of view, but I tend to not worry so much about such things. If the students want to shoot themselves in the foot, let them. I’d rather spend my time helping the students who want to learn than trying to trip up the students who are just looking at school as a hoop to be jumped through.

(And yes, I realize that some people can’t take notes for reasons of disability. This isn’t about them.)

via Hacker News.

An ancient magic trick: il laberinto di Ghisi

There’s a Conjuring Arts Research Center somewhere in midtown Manhattan. This video features Bill Kalush, their director, talking about and showing some highlights of their collection of books on the early History of Magic.

Of mathematical interest: the booklet features three two-page spreads with pictures of sixty saints each. The same sixty pictures are repeated three times. Each spread is divided into four groups of fifteen. You pick a saint, and in each of the three-page spreads you point out which of the four groups your saint is in. “Of course” this works on the following principle: the first pick narrows the number of possible pictures down to fifteen. The pictures are then rearranged so that in the second round, four (or perhaps three) pictures from each group in the first round appear in each group in the second round; that narrows it down to four (or three). In the third round the picture itself is found.

More pictures here (text in Italian). There’s a simulation of the mind-reading trick by Mariano Tomatis, magician and author. Tomatis also refers to a facsimile of the whole (21-spread) book that he’s prepared, and describes the difficulty he had in constructing it so that it would still work – the book is four centuries old, and therefore difficult to read in some places, but of course there is an internal logic to it. He’s also written the books Numeri assassini. Come scoprire con la matematica tutti i misteri del crimine and La magia dei numeri. Come scoprire con la matematica tutti i segreti del paranormale, as well as other books about magic that sound less mathematical.

(via metafilter.)

Are we all descended from Confucius?

Mark Liberman at Language Log asks this question, spurred by a Chinese professor’s claim to be a 73rd-generation descendant of Confucius. His conclusion: well, yeah, but if anyone in China is descended from Confucius (and this is documented), probably everyone in China is. Given a long enough time this would be true with “China” replaced by “the world”, but it probably hasn’t been long enough.

Why are correlations between -1 and 1?

So everyone knows that correlation coefficients are between $-1$ and $1$ . The (Pearson) correlation coefficient of $x_1, \ldots, x_n$ and $y_1, \ldots, y_n$ is given by

$r = \sum_{i=1}^n {1 \over n}\left( {x_i - \mu_x \over \sigma_x} \right) \left( {y_i - \mu_y \over \sigma_y} \right)$

where $\mu_x, \mu_y$ are the means of the $x_i$ and $y_i$ , and $\sigma_x, \sigma_y$ are their (population) standard deviations. Alternatively, after some rearrangement this is

$r = {{x_1 y_1 + \cdots + x_n y_n \over n} - \mu_x \mu_y \over \sigma_x \sigma_y}$

which is more convenient for calculation, but in my opinion less convenient for understanding. The correlation coefficient will be positive when $(x_i-\mu_x)\sigma_x$ and $(y_i-\mu_y)/\sigma_y$ usually have the same sign — meaning that larger than average values of $x$ go with larger than average values of $y$ — and negative when the signs tend to be mismatched.

But why should this be between $-1$ and $1$ ? That’s not at all obvious from just looking at the formula. From a very informal survey of the textbooks lying around my office, if a text defines random variables then it gives a proof in terms of them. For example, Pitman, Probability, p. 433, has the following proof (paraphrased): Say $X$ and $Y$ are random variables and $X^* = (X-E(X))/SD(X)$ , $Y^* = (Y-E(Y))/SD(Y)$ are their standardizations. First define correlation for random variables as $Corr(X,Y) = (E(XY)-E(X)E(Y))/(SD(X)SD(Y))$ . Simple properties of random variables give $Corr(X,Y) = E(X^* Y^*)$ . Then observe that $E(X^{*2}) = E(Y^*(2))=1$ and look at

$0 \le E(X^*-Y^*)^2 = 1+1-2E(X^*Y^*)$

and rearrange to get that $E(X^* Y^*) \le 1$ . Similarly looking at $X^*+Y^*$ gives $E(X^* Y^*) \ge -1$ . Finally, the correlation of a data set is just the correlation of the corresponding random variables.

This is all well and good if you’re introducing random variables. But one of the texts I’m teaching from this semester (Freedman, Pisani, and Purves, Statistics) doesn’t, and the other (Moore, McCabe, and Craig, Introduction to the Practice of Statistics) introduces the correlation for sets of bivariate data before it introduces random variables. These texts just baldly state that $r$ is between $-1$ and $1$ always — but of course some students ask why.

The inequality we’re talking about is an inequality involving sums of products: it’s really $Cov(X,Y) \le SD(X) SD(Y)$ . And that reminded me of the Cauchy-Schwarz inequality — but how to prove Cauchy-Schwarz for people who haven’t taken linear algebra? Wikipedia comes to the rescue. We only need the special case in $\mathbb{R}^n$ , in which case Cauchy-Schwarz reduces to

$\left( \sum_{i=1}^n u_i v_i \right)^2 \le \left( \sum_{i=1}^n u_i^2 \right) \left( \sum_{i=1}^n v_i^2 \right)$

for any real numbers $u_1, u_2, \ldots, u_n, v_1, v_2, \ldots, v_n$ . And the proof at Wikipedia is simple: look at the polynomial (in $z$ )

$(u_1 z + v_1)^2 + (u_2 z + v_2)^2 + \cdots + (u_n z + v_n)^2.$

This is a quadratic. As a sum of squares of real numbers it’s nonnegative, so it has at most one real root. So its discriminant is nonpositive. But we can write it as

$(u_1^2 + \cdots + u_n^2) z^2 + 2(u_1 v_1 + \cdots + u_n v_n) z + (v_1^2 + \cdots + v_n^2)$

and so its discriminant is

$4(u_1 v_1 + \cdots + u_n v_n)^2 - 4 (u_1^2 + \cdots + u_n^2) (v_1^2 + \cdots + v_n^2)$

and this being nonpositive is exactly the form of Cauchy-Schwarz we needed.

To show that this implies the correlation coefficient being in $[-1, 1]$ : let’s say we have the data $(x_1, y_1), \ldots, (x_n, y_n)$ and we’d like to compute the correlation between the $x_i$ and the $y_j$ . The correlation doesn’t change under linear transformations of the data. So let $u_i$ be standardizations of the $x_i$ and let $v_j$ be standardizations of the $y_j$. Then we want the correlation in $(u_1, v_1), \ldots, (u_n, v_n)$ . But this is just

${u_1 v_1 + \cdots + u_n v_n \over n}.$

By Cauchy-Schwarz we know that

$(u_1 v_1 + \cdots + u_n v_n)^2 \le (u_1^2 + \cdots + u_n^2) (v_1^2 + \cdots + v_n^2)$

and the right-hand side is $n^2$ , since $(u_1^2 + \cdots + u_n^2)/n$ is the standard deviation of the $u_i$ , and similarly for the other factor. Therefore

$(u_1 v_1 + \cdots + u_n v_n)^2 \le n^2$

and dividing through by $n^2$ gives that the square of the correlation is bounded above by $1$, which is what we wanted.

So now I have something to tell my students other than “you need to know about random variables”, which is always nice. Not that it would kill them to know about random variables. But I’m finding that intro stat courses are full of these black boxes that some students will accept and some want to open.

Signal amplification

A PhD student at Berkeley, in history, is seeking people who can break a code from a diary written during the Civil War.

(I learned about this from Tamara Broderick. See also Hacker News.)

James Grime’s Numberphile videos

James Grime has “Numberphile”, a series of videos about various numbers and how they relate to more serious bits of maths. He goes outside with a big piece of paper, a marker, and a cameraman, and films himself talking and writing on the paper. (But you don’t have to watch him write! Thanks to the magic of film editing, the numbers appear one by one.) In this one on 220 and 284, “amicable numbers” — the factors of 220 add up to 284, and vice versa. Over at Maths Gear they’re selling pairs of keyrings with these numbers inscribed on them.

James points out that although the (220, 284) pair was known to the ancients, and Euler had found thirty pairs by 1747, it wasn’t until 1866 that B. Nicolò I. Paganini (as far as I know, no relation to the violinist) discovered the pair (1184, 1210). I’m a little surprised by this; you hear the story of great feats of calculation in that era, how did this one slip by? It reminds me of the fact that supposedly people thought $2^{11}-1$ was prime when 2047 is trivial to factor by trial division. At least, if you have Arabic numerals…

The whole Numberphile channel is here (currently 19 videos, a couple hours in total). Some of my favorites are:

3/4 and Kleiber’s law. The metabolic rate of an animal goes up at the 3/4 power of its mass, perhaps due to metabolic scaling;
17 is the minimal number of clues needed to solve a sudoku. On a related note, Jason Rosenhouse and Laura Taalman have a book, Taking Sudoku Seriously
the initial video on the >number 11, released, somewhat inevitably, on 11/11/11. (If you’re in the UK, that’s 11/11/11.)

God Plays Dice… auf Deutsch!

Okay, not really. I don’t know German. I took “German for reading knowledge” one summer in grad school and I’ve forgotten almost all of it.

But I miss mathblogging. I worked at it for a few years and then I started to become self-conscious about my mostly dormant blog at blogspot. Who uses blogspot any more? And blogspot seems difficult to use for “serious” blogging — typing mathematics was difficult. But here on wordpress, it’s easy. Did you know that $e^{i \pi} + 1 = 0$ ? And that $\prod_{i=0}^\infty (1+z^{2^i}) = {1 \over 1-z}$ ? Really, they do.

My old math blog was called God Plays Dice. It started out as a project to procrastinate in the summer after I passed my oral exams; as you can see from the posting history there it had a couple good years but then I lost interest. This blog, too, is called God Plays Dice. The reason for the new subdomain name is because godplaysdice.wordpress.com is taken.

One major difference, I hope, is that while my old blog was a blog about mathematics, this will be the blog of a mathematician. Or, these days, perhaps a statistician — I’m teaching statistics at Berkeley and spending a lot of time around people who think about data has likely influenced what goes on in my head. Four and a half years ago I was looking inward, looking towards research and my dissertation; now I’m on the other side of graduation, looking towards what the rest of life will bring. And as Twitter has taught us, links without lots of accompanying commentary are worthwhile; that’s something I shied away from in the old days, but I’m hoping to change that this time around.

Here’s to a fresh start, on the second day of the second week of the second month of the second year of the second decade of… um… the second millennium, if you start counting millennia at zero?