Weekly links for May 13

Persi Diaconis, The Markov Chain Monte Carlo revolution. “To someone working in my part of the world, asking about applications of Markov chain Monte Carlo is a little like asking about applications of the quadratic formula.”

Depression, relatively speaking: you’re more likely to feel disabled by your depression if you believe that you suffer from severe symptoms relative to the rest of the population. (And we all know people are bad at estimating where they fall within distributions.) Link goes to Wall Street Journal; longer writeup at Neuroskeptic.

Ethan Fosse, sociology PhD candidate at Stanford, converted from Stata to R. There’s actually a whole web site devoted to converting to R, rconvert.com, mostly aimed at businesses that have a lot already built within the framework of proprietary technologies.

Republic of Mathematics has a roundup of links under the title “So you want to be a data scientist?”

Joshua Ganz writes about his 11-year-old son’s experience taking Stanford’s online game theory course. (Did you know you can preview a few hours of the lectures?)

Laura McLay (Punk Rock OR) asked her stochastic processes students to find the size of a zombie population during an outbreak.

Rick Wicklin on methods of testing your answers in statistical programming.

<a href="http://www.youtube.com/watch?v=w6xIVJe5tc4&feature=plcp".George Hart builds a do-deck-ahedron. (Is he taking inspiration from his daughter Vi in starting to make videos?)

The 4-peg tower of Hanoi.

How do you know if someone is great at data analysis?

Every major’s terrible.

Samuel Arbesman had a book which skipped from page 182 to page 215, which got him thinking about the math of bookbinding.

John D. Cook asks> how long it takes a knight, moving at random on a chessboard starting at a corner square, to return to its starting square. A couple days later he gives solutions.

Courses using Alpaydin’s machine learning textbook

I’m taking the machine learning course being taught by Andrew Ng at Coursera. At times it’s a bit light on the theory for my tastes, which is understandable, so I’ve been looking to other sources. One that I’d come across previously that I ended up buying is Ethem Alpaydin’s Introduction to Machine Learning.

But Alpaydin’s book has its own problem: a relatively small number of exercises, and no data. So it seems useful to find more exercises and people who have written data-based exercises to go with the book. The obvious place to do this is to find courses that have been taught using this book, so I decided to compile a list of such courses. I make no claim that this list is anywhere near a complete list; it was compiled by half an hour of Googling. But if I was going to make such a list, it seemed good to make it available.

Incidentally, I think in general it would be good to have lists of web pages of “courses taught using book X” available, both for learners (who might want to see supplementary resources, get a sense of which sections of a book are more or less important, and so on ) and for teachers (to see how others have organized their courses).

Here’s the list:

Alan Yuille, Introduction to machine learning, UCLA Stat 161/261, Spring 2010.
Ahmed Elgammal, Machine learning, Rutgers 198:536 (CS), spring 2007.
Joakim Nivre, Machine learning for NLP, Uppsala fall 2011.
Thorsten Joachims, Machine learning, Cornell CS 478, spring 2008.
Dan Lizotte, Introduction to machine learning, Reykjavik, spring 2007.
Alexander Partzin, Machine learning, Tel Aviv, fall 2007.
Berrin Yanikoglu, Machine learning, Sabanci University (Turkey) CS512, fall 2011.
Andrea Danyluk, Machine learning, Williams CS 374, spring 2011.
Shan-Hung Wu, Machine learning, National Tsing Hau University, spring 2012.
Zheng-Hua Tan, Machine learning, Aalborg (Denmark), spring 2011.
Kevin Murphy, Machine learning, British Columbia CS 340, fall 2006.
Jugal Kalita, Machine learning, University of Colorado at Colorado Springs CS 586, spring 2010.

Incidentally, a lot of courses in this area seem to recommend more than one text, because it’s a rapdily growing area. Others that seemed to be mentioned a lot in the same breath as Alpaydin are Bishop, Pattern Recognition and Machine Learning and Mitchell, Machine learning.

(Thanks to Brent Yorgey for a correction.)

OKCupid on gay and straight

OKCupid has dug into their dataset and has looked at gay sex vs. straight sex. (Safe for work, unless charts aren’t safe for work.) It turns out that at least in their data, men and women are equally promiscuous.

Back in 2007, and in more mainstream data sets, the numbers were different. The numbers seemed to vary from population to population, but one thing was consistent: men reported having had twice as many female sexual partners as women reported having male sexual partners. The obvious explanation, of course, is that people lie about how many sexual partners they’ve had, and that men and women lie in different directions (men adjust their number upwards, women downwards).

But this doesn’t show up in the OKCupid data set: the median number of sexual partners for both straight men and straight women, in their data set, is six. This is also the median number of sexual partners for gay men, and for gay women – OKCupid actually points this out to make the point that gay people are no less or more promiscuous than straight people. If you object to comparing medians, they actually give the whole distribution curve; the distributions of number of sexual partners for OKCupid-using straight people and OKCupid-using gay people are substantially the same. (Not having the raw data, I can’t say if the difference is statistically significant, but who cares?)

Of course this only says something about the self-selecting pool of OKCupid users. But it seemed worth calling out.

A half-baked idea for modifying Scrabble scores

I’ve recently been listening to an excellent podcast on language from Bob Garfield and Mike Vuolo Slate, called Lexicon Valley. You may remember that back in March I pointed out that my name is supervocalic, i. e. it contains each vowel exactly once; in an early episode they ask a similar question, to find celebrities (Charlie Daniels is one example) who have the same vowels in both names.

In March they did an episode about Scrabble, a game which I’ve taken a renewed interest in because my girlfriend is much better at it than I am. But a large part of this is simply that she knows more obscure words than I do. Stefan Fatsis is the author of the book Word Freak: Heartbreak, Triumph, Genius, and Obsession in the World of Competitive Scrabble Players and a competitive Scrabble player himself, and was interviewed for the Scrabble episode of Lexicon Valley. Apparently the reliance of Scrabble on obscure words is seen as something of a problem in competitive Scrabble as well. North American players use a different word list than the rest of the world, and the North American list is shorter; some players don’t want to move to the longer list because they feel it contains too many obscure words.

One idea that occurs to me — although I don’t know how one would implement this — would be to modify the score that a word receives with some multiplier, a function of the frequency with which the word is used. (I wouldn’t use the frequency of the word itself; then Scrabble would reduce to seeing who can play THE the most.) But this would make scoring much harder — you’d have to pause to use lookup tables after every word. Computers, however, can handle this. More importantly it would make scoring much less transparent. This seems especially a flaw in the end of the game; with opponents that I’m well-matched with games can come down to the final few moves and I know exactly how many points my words will receive.

(And in case you’re wondering: if I had to name a baby I would lean towards first names that contain the vowels A, E, and I exactly once each, and no O or U.)

Stanford interview with Reviel Netz

Stanford’s in-house news site has an interesting interview with Reviel Netz on mathematical proofs as literature. Netz is also the author of the fascinating book The Archimedes Codex: How a Medieval Prayer Book Is Revealing the True Genius of Antiquity’s Greatest Scientist. (By way of explaining the subtitle: the earliest known manuscript of Archimedes is a palimpsest.)

Weekly links for May 6

Catherine Ulitsky’s paintings, including some like this one that are basically Delaunay triangulations of the positions of birds in a flock. (via Radiolab)

John Cook on Traveling Salesman art, based on a traveling salesman app, which is a companion to Bill Cook’s book In Pursuit of the Traveling Salesman. (I haven’t read it. I also don’t know if the two Cooks are related.)

Kevin Carey argues that everyone should learn statistics because everyone has to serve on juries.

Julian Champkin writes for Significance Magazine about the data journalism handbook

John Allen Paulos on screening the screening tests.

Brian Hayes: Statistical mechanics of magnet balls.

Shankar Vedantam, NPR: Most of us aren’t average – the usual about how many things follow power laws. (From the title I was hoping this would be “most of us are not average at everything“.)

Pete Casazza, A mathematician’s survival guide. (via the AMS grad student blog.)

What’s a number, by Tom Christiansen. (via John Cook)

John Kerl’s Tips for mathematical handwriting.

R tutorial videos

People who want to learn the very basics of R may find these videos made by some Berkeley grad students useful.

Devlin’s Coursera transitions course

Keith Devlin writes about MOOCs, or “massive open online courses”, such as those offered by udacity and coursera¹. In particular he’s going to be offering a five-week “math transitions” course in October, via Coursera. Devlin writes:

Such courses typically comprise a mix of some elementary mathematical logic, proof techniques, some set theory through to an analysis of relations and functions, with a bit of elementary number theory and introductory real analysis thrown in to provide examples.

I’m a bit skeptical about this, because the coursera platform involves automated grading. This is fine for courses where problems have numerical answers, or for courses where the assignments are to program and whether a program works can be validated by an automated system. But the transition course is in some way the course where students learn how to prove things; I almost want to say it’s fundamentally a writing course. This is a problem that one runs into even when teaching in-person courses, if the course is large enough that the grading is outsourced to an inexperienced graduate student or even an undergraduate, as is common in some places; sometimes the grader is simply not experienced enough to give really high-quality feedback. Of course in many situations the grader could give such feedback but doesn’t have time to do so, which is really a different issue. But it seems like it will only be worse in the online format. I’m sure Devlin is aware of this, though, and I’ll be interested to see what he and his TAs do.

1. why is udacity a .com and coursera a .org? In both cases it looks like the company registered the “other” domain at the same time, so it’s not a question of availability of domains.

A forgotten psuedorandom permutation on 26 letters

I’m reasonably sure that long ago and totally by accident, I discovered a permutation of the alphabet a, b, …, z that somehow naturally arose from the order of letters on the QWERTY keyboard and had order 630. One such permutation would be (abcdefg)(hijklmnop)(qrstuvwxyz), which has cycles of order 7, 9, and 10 and therefore has order the least common multiple of 7, 9, and 10, which is 630. But of course this doesn’t naturally arise from the keyboard. 630 is interesting here because it’s ~~the largest order of a permutation of 26 elements~~ fairly large for the order of a permutation of 26 elements; the maximum is twice this, 1260, as pointed out by several commenters.

I had thought that this permutation was the one that, in the two-line notation, is written

abcdefghijklmnopqrstuvwxyz qwertyuiopasdfghjklzxcvbnm

which takes a to q, b to w, and so on. But I checked during an idle moment earlier today; rewriting this in the cycle notation gives

(aqjphioguxbwvcetzmdrk)(fyn)(ls)

which has cycles of length 21, 3, and 2 and therefore has order lcm(21, 3, 2) = 42. So what was I thinking of?

Answer, added Wednesday, May 2: instead of going horizontally, go vertically: the second line is qazwsxedcrfvtgbyhnujmikolp, which gives the 7-9-10 cycle type.