Pythagoras goes linear

Let x_i and y_i both be uniform on [0, 1]. Let w_i be the smaller of the two, and let z_i be the larger. Let h_i = \sqrt{w_i^2 + z_i^2}. So (x_i, y_i) is a random point in the unit square, and h_i is its distance from the origin. We can predict this distance using linear regression. For example, in R, we can pick 10^4 such points and execute the code

x=runif(10^4,0,1)
y=runif(10^4,0,1)
w=rep(0,10^4); for(i in 1:10^4){w[i]=min(x[i],y[i])}
z=rep(0,10^4); for(i in 1:10^4){z[i]=max(x[i],y[i])}
h=sqrt(w^2+z^2)
lm(h~0+w+z)

to fit a linear model of the form h = aw+bz. The least-squares model here is, for this particular simulation, h = 0.4278w + 0.9339z, with R^2 = 0.9995. In other words, the formula

h = 0.4278 \min(x,y) + 0.9339 \max(x,y)

appears to predict a as a linear function of \min(x,y) and $\max(x,y)$ quite well, and so the hypotenuse of a triangle is 0.4278 times its shorter leg, plus 0.9339 times its longer leg. For a particular famous special case, try x = 3, y = 4; then we predict the hypotenuse is 0.4278(3) + 0.9339(4) = 5.019, quite close to the true value of 5.

Andrew Gelman and Deborah Nolan, in Teaching Statistics: A bag of tricks, give a very similar example, with slightly different numerical parameters and quip that “if Pythagoras knew about multiple regression, he might never have discovered his famous theorem”. (p. 146). They fit a model that is allowed to have nonzero constant term; I choose to fit a model with zero constant term. I think that our anachronistic Pythagoras would have had the sense to observe that if we double x and y, we should double the hypotenuse as well.

The natural question, to me, is to determine the “true” constants. So what constants a and B give the linear function ax+by that best approximates \sqrt{x^2+y^2}, when we restrict to 0 < x < y < 1? The reason for the triangular-shaped region is that we’re restricting to the case where $x$ is smaller and $y$ is larger. To be consistent with our Pythagoras-as-linear-regressor model, we’ll make the approximation in the least-squares sense. So we want to minimize

$f(a,b) = latex \int_0^1 \int_0^y \left( \sqrt{x^2+y^2} – (ax+by) \right)^2 \: dx \: dy $

as a function of a and b. This is a calculus problem. Expand the integrand to get

\int_0^1 \int_0^y x^2+y^2+a^2 x^2 + b^2 y^2 + 2ab xy - 2ax \sqrt{x^2+y^2} - 2by \sqrt{x^2+y^2} \: dx \: dy.

The polynomials are easy to integrate; the square-root terms somewhat less so, if it’s been a while since you’ve done freshman calculus. But after a bit of work this is

f(a,b) = {1 \over 12} a^2 + {1 \over 4} ab + {1 \over 4} b^2 + {1 \over 3} - C_1 a - C_2 b

where C_1 = (2\sqrt{2}-1)/6, C_2 = (\sqrt{2}+\sinh^{-1} 1)/4. Differentiating we get

{\partial \over \partial a} f(a,b) = {1 \over 6} a + {1 \over 4} b - C_1

and

{\partial \over \partial b} f(a,b) = {1 \over 4} a + {1 \over 2} b - C_2.

Set both of these equal to zero and solve to get
a = 24C_1 - 12C_2 = 5 \sqrt{2} - 4 - 3 \sinh^{-1} 1 = 0.4269, b = -12C_1 + 8C_2 = -2\sqrt{2} + 2 + 2 \sinh^{-1} 1 = 0.9343

which are tolerably close to the coefficients that came out of the regression. (Those coefficients had standard errors of 0.0009 and 0.0005 respectively.)

Of course our hypothetical Pythagoras couldn’t have done these integrals, and would not have liked that they turn out to be irrational. Perhaps he would have just said that the length of the hypotenuse of a triangle was three-sevenths of the shorter leg, plus fourteen-fifteenths of the longer leg.

15 thoughts on “Pythagoras goes linear

  1. Very nice. This reminds me of an old shortcut that I read while doing graphics work on a 386: sqrt( x^2 + y^2 ) is approximately 1/2 min(x,y) + max(x,y). Translated into your above equations, this is a = 1/2, b = 1.

    This is very easy to pull of when your x and y are integers or fixed-point numbers. I suppose you could pull off fixed point versions of your coefficients, too.

    One advantage of a = 1/2, b = 1 has over a = 0.4269, b = 0.9343 though is that axis-aligned distances are exact in the former and underestimated in the latter.

  2. I tried computing the exact value of the coefficient of determination. I might be wrong, but I get

    \frac{50-30 \sqrt{2}+18 \sinh^{-1}(1)-22 \sqrt{2} \sinh^{-1}(1)+7 \sinh^{-1}(1)^2}{8-4 \sqrt{2} \sinh ^{-1}(1)-2 \sinh ^{-1}(1)^2}

    This is approximately equal to 0.995607, which has one less 9 than reported in the original post.

  3. I tried a similar computation when you consider the hypotenuse as a function only of the longer side. Then you get r^2 of 90% (correlation is ~95%).

    As a function of only the shorter side, you get r^2 of about 55% (correlation is ~74%). How does that compare with results in a science other than physics?

    (I computed all of these values exactly, but they’re not pretty …)

  4. I don’t leave a response, however I read some of the remarks here Pythagoras goes linear | God plays dice. I do have a few questions for you if you don’t
    mind. Is it simply me or does it appear like a few of these
    comments come across as if they are coming from brain
    dead visitors? 😛 And, if you are posting at additional online social sites,
    I’d like to keep up with everything new you have to post. Could you make a list of all of all your social community pages like your linkedin profile, Facebook page or twitter feed?

  5. Wow that was unusual. I just wrote an very long comment but after I clicked submit my comment didn’t show up. Grrrr… well I’m not writing all that over again.

    Regardless, just wanted to say excellent blog!

  6. Wow, superb blog layout! How long have you been
    blogging for? you made blogging look easy.
    The overall look of your website is fantastic, as
    well as the content!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s