Why are correlations between -1 and 1?

So everyone knows that correlation coefficients are between $-1$ and $1$ . The (Pearson) correlation coefficient of $x_1, \ldots, x_n$ and $y_1, \ldots, y_n$ is given by

$r = \sum_{i=1}^n {1 \over n}\left( {x_i - \mu_x \over \sigma_x} \right) \left( {y_i - \mu_y \over \sigma_y} \right)$

where $\mu_x, \mu_y$ are the means of the $x_i$ and $y_i$ , and $\sigma_x, \sigma_y$ are their (population) standard deviations. Alternatively, after some rearrangement this is

$r = {{x_1 y_1 + \cdots + x_n y_n \over n} - \mu_x \mu_y \over \sigma_x \sigma_y}$

which is more convenient for calculation, but in my opinion less convenient for understanding. The correlation coefficient will be positive when $(x_i-\mu_x)\sigma_x$ and $(y_i-\mu_y)/\sigma_y$ usually have the same sign — meaning that larger than average values of $x$ go with larger than average values of $y$ — and negative when the signs tend to be mismatched.

But why should this be between $-1$ and $1$ ? That’s not at all obvious from just looking at the formula. From a very informal survey of the textbooks lying around my office, if a text defines random variables then it gives a proof in terms of them. For example, Pitman, Probability, p. 433, has the following proof (paraphrased): Say $X$ and $Y$ are random variables and $X^* = (X-E(X))/SD(X)$ , $Y^* = (Y-E(Y))/SD(Y)$ are their standardizations. First define correlation for random variables as $Corr(X,Y) = (E(XY)-E(X)E(Y))/(SD(X)SD(Y))$ . Simple properties of random variables give $Corr(X,Y) = E(X^* Y^*)$ . Then observe that $E(X^{*2}) = E(Y^*(2))=1$ and look at

$0 \le E(X^*-Y^*)^2 = 1+1-2E(X^*Y^*)$

and rearrange to get that $E(X^* Y^*) \le 1$ . Similarly looking at $X^*+Y^*$ gives $E(X^* Y^*) \ge -1$ . Finally, the correlation of a data set is just the correlation of the corresponding random variables.

This is all well and good if you’re introducing random variables. But one of the texts I’m teaching from this semester (Freedman, Pisani, and Purves, Statistics) doesn’t, and the other (Moore, McCabe, and Craig, Introduction to the Practice of Statistics) introduces the correlation for sets of bivariate data before it introduces random variables. These texts just baldly state that $r$ is between $-1$ and $1$ always — but of course some students ask why.

The inequality we’re talking about is an inequality involving sums of products: it’s really $Cov(X,Y) \le SD(X) SD(Y)$ . And that reminded me of the Cauchy-Schwarz inequality — but how to prove Cauchy-Schwarz for people who haven’t taken linear algebra? Wikipedia comes to the rescue. We only need the special case in $\mathbb{R}^n$ , in which case Cauchy-Schwarz reduces to

$\left( \sum_{i=1}^n u_i v_i \right)^2 \le \left( \sum_{i=1}^n u_i^2 \right) \left( \sum_{i=1}^n v_i^2 \right)$

for any real numbers $u_1, u_2, \ldots, u_n, v_1, v_2, \ldots, v_n$ . And the proof at Wikipedia is simple: look at the polynomial (in $z$ )

$(u_1 z + v_1)^2 + (u_2 z + v_2)^2 + \cdots + (u_n z + v_n)^2.$

This is a quadratic. As a sum of squares of real numbers it’s nonnegative, so it has at most one real root. So its discriminant is nonpositive. But we can write it as

$(u_1^2 + \cdots + u_n^2) z^2 + 2(u_1 v_1 + \cdots + u_n v_n) z + (v_1^2 + \cdots + v_n^2)$

and so its discriminant is

$4(u_1 v_1 + \cdots + u_n v_n)^2 - 4 (u_1^2 + \cdots + u_n^2) (v_1^2 + \cdots + v_n^2)$

and this being nonpositive is exactly the form of Cauchy-Schwarz we needed.

To show that this implies the correlation coefficient being in $[-1, 1]$ : let’s say we have the data $(x_1, y_1), \ldots, (x_n, y_n)$ and we’d like to compute the correlation between the $x_i$ and the $y_j$ . The correlation doesn’t change under linear transformations of the data. So let $u_i$ be standardizations of the $x_i$ and let $v_j$ be standardizations of the $y_j$. Then we want the correlation in $(u_1, v_1), \ldots, (u_n, v_n)$ . But this is just

${u_1 v_1 + \cdots + u_n v_n \over n}.$

By Cauchy-Schwarz we know that

$(u_1 v_1 + \cdots + u_n v_n)^2 \le (u_1^2 + \cdots + u_n^2) (v_1^2 + \cdots + v_n^2)$

and the right-hand side is $n^2$ , since $(u_1^2 + \cdots + u_n^2)/n$ is the standard deviation of the $u_i$ , and similarly for the other factor. Therefore

$(u_1 v_1 + \cdots + u_n v_n)^2 \le n^2$

and dividing through by $n^2$ gives that the square of the correlation is bounded above by $1$, which is what we wanted.

So now I have something to tell my students other than “you need to know about random variables”, which is always nice. Not that it would kill them to know about random variables. But I’m finding that intro stat courses are full of these black boxes that some students will accept and some want to open.

6 thoughts on “Why are correlations between -1 and 1?”

David Glasser says:

February 9, 2012 at 8:27 pm

This could use a definition of “standardization” for those of us following along up to there but not statistics experts.

1. Michael Lugo says:
  
  February 9, 2012 at 8:37 pm
  
  Fair enough. To standardize a set of numbers $(x_1, x_2, \ldots, x_n)$ you just subtract their mean and then divide by their standard deviation — so you get numbers that indicate how far above or below the mean they are, in units of their standard deviation. Similarly for random variables.
  
C says:

November 21, 2013 at 6:18 pm

don’t like this proof by standardization first

Wimivo says:

June 15, 2014 at 9:36 am

There’s a bit of error in the notation from the Pitman proof. In particular, what we really want is

0 =< E[ (X* – Y*)^2] = 1 + 1 -2E[X*Y*]

The way it's written, it looks as though the entire expectation is squared (which gives a trivial and useless result), whereas we really want the expression within the expectation to be squared.

text messae recovery says:

September 15, 2014 at 3:27 am

Nice post. I used to be checking constantly this blog and I’m inspired!
Very useful information specially the closing part 🙂 I handle such info much.
I was looking for this particular info for a very lengthy
time. Thank you and good luck.

benoit says:

April 21, 2015 at 6:34 am

Thanks a lot from France for the part with random variables. I don’t know if Cauchy Schwarz inegality works in this case but your demonstration is nice without. Of course,as you writed,it in this article, the case with n values of statistics without random variables works well with it.