Weekly links for June 10

Day 1 problems, Day 2 problems and solutions (both days) to the first day of the USA Mathematical Olympiad, 2012.

Alejandro at Knewton briefly explains item response theory, a method for scoring exams. (Say two students get 9 out of 10 on an exam; one misses the easiest question, one the hardest. Which one is better at what the exam tests?)

Nate Silver, of fivethirtyeight, has launched the 2012 presidential election forecast.

Tim Gowers asks How should mathematics be taught to non-mathematicians? The post is motivated by certain proposed changes to secondary education in the UK, to introduce courses in “Uses of Mathematics”, but most of the post is devoted to suggesting the sort of questions that students in such courses would be able to answer, and you don’t need to know anything about the UK education system to appreciate these.

A graph in a glass: a machine that turns the distributions of fruits mentioned on Twitter into smoothies. (I’d prefer a pie chart made of actual pie.)

Distributional footprints of deceptive product reviews. Some companies soliciting people to write fake reviews of their products get too greedy, and this can be detected.

High school kids are assholes. (Not the actual title, which is “friendship networks and social status”.) In brief: “In every network, without exception, we find that there exists a ranking of participants, from low to high, such that almost all unreciprocated friendships consist of a lower-ranked individual claiming friendship with a higher-ranked one.” Perhaps I’d have more to say if the subject if this paper were emotionally neutral, but I’m not in the mood to dredge up painful memories.

Jordan Ellenberg’s review of Alexander Masters’ Simon: The Genius in My Basement. I mentioned this book back in March in a weekly links post (in which I also mentioned Jordan!).

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

How do departments get their names?

Jean Joseph, at the AMS Grad Math Blog, asks why some departments are called “Department of Mathematics” and others are called “Department of Mathematical Sciences”. The obvious explanation is that the “Mathematical Sciences” ones are more applied, but that doesn’t necessarily hold.

“Department of Mathematics” is much more common; I get 9,090,000 Google hits for it, compared to 531,000 for “Department of Mathematical Sciences”, for a 17 to 1 ratio.

In my Googling, the first ten hits for “Department of Mathematics” are the departmental web pages of Berkeley, Stanford, Washington, Purdue, Penn State, Florida State, Chicago, Wisconsin, MIT, and UCLA.

The first ten hits for “Department of Mathematical Sciences” are the departmental web pages of Carnegie Mellon, Clemson, Montana, Delaware, Michigan Tech, New Jersey Institute of Technology, Cincinnati, Florida Atlantic, Montana (again), and Central Connecticut State.

I don’t know how to interpret this data; obviously the “Department of Mathematical Sciences” schools are less notable, but that makes sense simply because there are less of them. (Besides, I don’t want to be on record as insulting Carnegie Mellon because someone I love is in Pittsburgh.)

Now, historically statistics departments tend to be more applied in their outlook than mathematics, so if Joseph’s idea is right, then perhaps we’d expect “Statistical Sciences” to be more common, relatively speaking.

For “Department of Statistics” I get 4,640,000 hits; the first ten are Berkeley, Stanford, Washington, Penn State, Texas A&M, Oxford, UCLA, Chicago, Purdue, and Michigan. For “Department of Statistical Sciences” I get 63,400 hits, for a 73 to 1 ratio. The hits here start with Cornell, University College London, Duke, Cape Town, Padua, Virginia Commonwealth (which is actually “Statistical Sciences and Operations Research”), VCU again (this time a listing of their faculty), VCU again (some sort of “handbook”), VCU again (the page of Paul Brooks), and a flyer about Padua’s department. Interestingly, Cornell can’t make up its mind what to call its department; the HTML title of their page is apparently “Department of Statistics” but the banner at the top of the page identifies them as “Department of Statistical Science”.

So if anything, math departments are more likely to add “science” to their name than stats departments. Why?

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Santana’s no-hitter redux

Slate‘s sports podcast, “Hang Up and Listen”, talked about Johan Santana’s June 1 no-hitter in their most recent episode; I mentioned it back on June 2. Starting at 47:47, they talk briefly about this post by tangotiger. tangotiger argues that of the 27 outs in the game, all but six were “routine” outs; he figures that given the distribution of batted balls, Santana should have given up about two hits. If like me, you didn’t see the game, you can see video of all 27 outs at mlb.com. (The “blown call” that’s been mentioned in a lot of places came in the top of the sixth, with Carlos Beltran batting. It’s a line drive down the third base line that was ruled foul.)

But every no-hitter has some degree of luck. Consider the following model: the batter hits the ball. Depending on where he hits it, that sets the probability of heads of a certain (imaginary) coin, i. e. a Bernoulli random variable. Take this to be 0 for a strikeout, 1 for a home run, and somewhere in between for balls in play. (Of course you could go back a step and start with the pitcher pitching.) Then if that coin comes up heads, the ball is a hit, and if not it’s an out. For each innings, record the number of hits until getting three outs; nine innings make a ball game.

Then for each team in every ball game you get two numbers: the sum of those probabilities of heads, which you could call the “expected” number of hits, and the actual number of hits. On average they’ll be the same. And of course they’re highly correlated. But conditional on the actual number of hits being 0, which is well below the average, the sum of the probability of heads — the “expected” number of hits — will be somewhere greater than 0, always. (Unless we’re talking about a 27-strikeout game, which happened once in the minors in 1952 and never in the majors. This is just regression to the mean.

With the right data set you could empirically determine the probability that any given batted ball goes for a hit, and for recent no-hitters (where that data is presumably available somewhere) compute how much the “average” amount of luck is. I don’t have that data, though. But some pitchers of no-hitters benefited more from luck than others, and this wouldn’t be a horrible way to quantify that.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Spelling and prime factorization

Ben Zimmer writes a column for the New York Times, “On Language”. His June 25, 2010 column was entitled Ghoti. It’s not about beards. That’s not a misspelling of “goatee”. Rather, it’s a misspelling of “fish” (the “gh” of “enough”, the “o” of “women”, and the “ti” of “action”) that’s traditionally attributed to George Bernard Shaw.

In this column we learn about the absurd respellings that Alexander Ellis, a mid-ninteenth-century spelling reformer, came up with. And he did some calculations. He thought “scissors” should be spelled “sizerz” (okay, that’s not bad, although how would you spell “sizers”, as in “people who size”?), but at least it’s not spelled “schiesourrhce” (“combining parts of SCHism, sIEve, aS, honOUr, myRRH and sacrifiCE.”).

And Ellis gave three different numbers for the number of possible spellings of “scissors”: 1745226, 58366440, and 81997920. In the interest of trying to guess where these came from, the first thing that comes to mind is finding the prime factorizations. Why? Well, say someone told us “there are twelve ways to spell cat“. We’d logically think that they’d come up with, say, three ways to spell the first sound of that word (say, “c”, “k”, and “ck”) , three ways to spell the second sound (“a” and “ah”), and two ways to spell the third sound (“t” and “tt”), for a total of 3 \times 2 \times 2 = 12 spellings:

cat, catt, caht, cahtt, kat, katt, kaht, kahtt, ckat, ckatt, ckaht, ckahtt

Of course English doesn’t work that way — you can spell the first sound of “cat” as “ck” but not at the beginning of a word! Zimmer tells us that Ellis acknowledged this. But if you assume the calculation was done this way, then twelve is an easy number to get. But eleven and thirteen are less likely, being primes. The numbers obtained in this way should be products of relatively small numbers, and therefore shouldn’t have large prime factors. And indeed we get

1745226 = 2 \times 3^8 \times 7 \times 19, 58366440 = 2^3 \times 3^3 \times 5 \times 11 \times 17^3, 81997920 = 2^5 \times 3^6 \times 5 \times 19 \times 37

and these could conceivably be products of six relatively small numbers. For example:

1745226 = 9 \times 193914 = 9 \times 9 \times 21546 = 9 \times 9 \times 14 \times 1539
= 9 \times 9 \times 14 \times 9 \times 171 = 9 \times 9 \times 14 \times 9 \times 9 \times 19

58366440 = 20 \times 2918322 = 20 \times 18 \times 162129 = 20 \times 18 \times 17 \times 9537
= 20 \times 18 \times 17 \times 17 \times 561 = 20 \times 18 \times 17 \times 17 \times 17 \times 33

1997920 = 20 \times 4099896 = 20 \times 19 \times 215784 = 20 \times 19 \times 24 \times 8991
= 20 \times 19 \times 24 \times 27 \times 333 = 20 \times 19 \times 24 \times 27 \times 9 \times 37

Where did I get these from? Let’s consider how I went from 20 \times 18 \times 162129 to 20 \times 18 \times 17 \times 9537 in my decomposition of 58366440. I’ve already written 58366440 = 20 \times 18 \times 162129. I know I’m going to have to write 162129 as a product of four numbers, so they’re going to be near 162129^(1/4) = 20.07. It turns out that 162129/17 is an integer, namely 9537, and no factor of 162129 is closer to its fourth root than 17 is. (That is, 18, 19, 20, 21, 22, and 23 are not factors of 162129.) This is a greedy algorithm, and these aren’t optimal decompositions in the sense of having the smallest sum. For example in the last one I could replace 24 and 9, which multiply to 216, with 18 and 12 which have the same product but a smaller sum. But there’s no reason to expect that Ellis’ products had this property anyway; some sounds can be spelled in more way than others. In particular the last one of these is unlikely to be what Ellis came up with, because the word “scissors” has two of the same sound — so I’d expect two of the factors to be the same. But what do you want from a greedy algorithm?

By the way, it’s not terribly hard to write down rules for going from spelling to pronunciation that work reasonably well. It seems like the same should be true of the reverse.

I’m looking for a job! See my linkedin profile.

Statwing: dead simple statistical analysis

Statwing is described in its crunchbase profile as “Web-based statistical analysis software that speaks in plain english instead of arcane stats jargon.” (Crunchbase, for those who don’t know, is a directory of technology companies.) The founders are a pair out of Stanford, Greg Laughlin (whose background is in the social sciences) and John Le (whose background is in CS), which seems like a good pair for something like this.

They’re not live yet, but if you sign up at their website they’ll let you know when Statwing is ready.

(Disclaimer: I don’t know these people, and I haven’t seen the product, but this seems like a niche that could use filling.)

Urbanspoon’s imaginary geography

Urbanspoon is a restaurant booking service. Urbanspoon SF Bay Area has, at the top of the page, seven links to urbanspoon pages for other areas: Fresno, Los Angeles, New York, Orange County, Sacramento, San Diego, and Santa Barbara. Six other California areas — and New York.

My guess is that these are the cities that people who are sometimes in the Bay Area are most likely to be in when they’re not in the Bay Area. From urbanspoon’s point of view this makes more sense than using simple earthbound geography.

When I click on New York, I similarly see such a mixture. Five northeastern cities: Baltimore, Hartford, “North Jersey”, Philadelphia, and Providence. And two far away: Los Angeles and the SF Bay Area. (However, I am in California, and urbanspoon may be using this information. In particular once I start clicking on cities at random I end up seeing cities that I’ve clicked on before. I won’t give any of the data I gathered this way because it seems to be taking into account my history, not just the histories of others which is what I’m trying to mine.)

Go to urbanspoon.com/choose and click on your city. What other cities do you see listed? Does this feel right to you?

Weekly links for June 3

Lots of links this week! I’m not sure why it worked out that way.

Meena Boppana, a high school student who participated in RSI 2011, gives a talk on Top Ten Reasons Why I Love Math. (From TedXHunterCCS.)

Desmos, a free online graphing calculator.

Michael Sandel, Harvard professor and author of Justice: What’s the right thing to do?, asks in his new book What Money Can’t Buy: The moral limits of markets whether quantification is the first step to moral decay. Via getstats. I’d like to think he’s wrong; I haven’t read the book. You can also watch Sandel’s Justice lectures online.

Shai Simonson and Fernando Gouvea have an essay on how to read mathematics.

The University of Minnesota has a catalog of open access textbooks.

From fivethirtyeight.com: swing voters and elastic states.

For the last three weeks Andrew Gelman has been posting one question per day from the (28-question) final exam for his course in Design and Analysis of Sample Surveys. Here’s Question 1. Here’s Question 2, and the solution to Question 1. By editing the URL you can find a sequence of posts each of which contains Question N and the solution to Question N-1; so far he’s up to Question 23.

A paper I’m surprised I’d never seen before: Methods for Studying Coincidences by Diaconis and Mosteller. Via Samuel Arbesman at Wired.

Cosma Shalizi, guest-posting (?) at Crooked Timber: In Soviet Union, Optimization Problem Solves You.

Econ films is making short films about economics. Someone should do this for mathematics. (I would, if I knew how to make films.) Via Tim Harford’s twitter.

David Mackay: A reality check on renewables, featuring back-of-the-enveloep calculations on the mathematics of renewable energy. From TedXWarwick. “I love renewables, but I also love arithmetic”; the arithmetic shows that substantial fractions of the UK would need to be covered in wind farms or solar panels to have serious effects. For lots more of this see his book Sustainable Energy – Without the Hot Air which is available online.

The changing complexity of Congressional speech, from the Sunlight Foundation.

Better drinking through data, from Prior knowledge.

Mark Dominus asks at stack exchange Why does mathematical convention deal so ineptly with multisets?

David Speigelhalter on World at One on the effect of drinking; and his comments; does a unit of alcohol (10 ml, about half a US “standard drink”) cost five minutes of life?

It’s possible to win the US presidential election with 11 votes.

Multi-armed bandit algorithms vs. A/B testing, from Hacker News.

Forecast Advisor will tell you how accurate various weather forecasters are for your area.

The Mets’ no-hitter drought

Johan Santana pitched a no-hitter for the New York Mets on Friday night, the first one in Mets history. The Mets have played 8,019 games since they came into existence in 1962.

Jim Pagels at Slate asks: how unlikely it is that the Mets would have suffered this? The conclusion, based on the fact that no-hitters occur every 1600 games or so: roughly one in a hundred. There are twenty teams that have been around for at least fifty years; so the chance that some team would have a fifty-year drought going are about twenty in one hundred, or one in five.

The Mets’ drought becomes a bit more surprising if you take into account, as Craig Glaser did coincidentally last week, that the Mets have historically had good pitching. A better model than Glaser’s would treat each season separately, predicting the number of no-hitters the Mets should have expected in each season — or even separate out each pitcher within those seasons — but that would be real work, and I’m not sure if it would cause an appreciable improvement.

What proportion of months contain parts of six calendar weeks?

What proportion of months have that annoying property that, on an old-fashioned paper calendar, the 23rd and 30th, or 24th and 31st, have to be scrunched up into a single box? Or, on a computerized calendar, six rows are necessary? Or, if we don’t want to refer to a particular calendar format, that it contains parts of six (Sunday-to-Saturday) calendar weeks?

For example, consider a 30-day month that starts on a Saturday, like September 2012, which is the next example of this phenomenon:

Sun Mon Tue Wed Thu Fri Sat
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30

31-day months that start on Friday or Saturday also have this property. To identify this month we can quickly code up Zeller’s congruence for the day of the week:


zeller = function(m, d, y){
if(m K = y %% 100;
J = y - 100*K;
h = (d + floor(13*(m+1)/5) + K + floor(K/4) + floor(J/4) - 2*J) %% 7;
h;
}

This returns 0 for Saturday, 1 for Sunday, …, 6 for Friday.

Then put the lengths of the months of the year in a vector:

lengths = c(31,28,31,30,31,30,31,31,30,31,30,31);

(You may object “what about leap year!” — but that doesn’t concern us, as February, even in leap year, can never require six rows.)

The sixweeks function returns TRUE if a month contains parts of six (Sunday to Saturday) calendar weeks, and FALSE otherwise:


sixweeks = function(days, first){
((days == 30) && (first == 0)) || ((days == 31) && (first == 0)) || ((days == 31) && (first == 6))}

Now the Gregorian calendar has a period of 400 years. So we just run over some 400-year period and run sixweeks on every week. The result is a vector containing the number of each month which fall within parts of six calendar weeks, in that 400-year cycle.


counts = rep(0, 12);

for(y in 2000:2399){
for(m in 1:12){
first = zeller(m, 1, y);
days = lengths[m];
counts[m] = counts[m] + sixweeks(days, first);
}
}

The output is (cleaned up a bit, and with the month names inserted):

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
116 0 112 60 116 56 116 112 56 116 56 116

These numbers add up to 1032 (run sum(counts)). So in each 400-year period, 1032 months out of 4800, or exactly 21.5%, include parts of six calendar weeks. 116 of these months are January, 0 are February, 112 are March, and so on. (If you believe that a calendar week is Monday to Sunday, because you take the dictates of the ISO too seriously or because you’re European, it’s not hard to adapt the sixweeks function to that; instead of 1032 you get 1028.)

Could we have predicted this number without the need for computation? We can come pretty close. Seven out of every 12 months is a 31-day month; of those about two-sevenths should start on a Friday or a Saturday. Similarly, four out of every 12 months is a 30-day months, and one-seventh of those should start on a Saturday. So the probability that a randomly chosen month contains parts of six calendar weeks ought to be quite close to
{7 \over 12} \times {2 \over 7} + {4 \over 12} \times {1 \over 7} = {18 \over 84} \approx 0.214
and indeed we come pretty close!  In fact this post is historically backwards. I did this calculation first and then went to the computer and wrote the code to check it.