Weekly links for May 20

How to win at Battleship, in advance of the movie of the same name. Link goes to the non-technical description at Slate; here’s the technical version, by Nick Berry.

How subway networks evolve, from Scientific American.

Given a sack of sugar, an unbalanced scale, and two 5-pound weights, measure exactly 10 pounds of sugar. (H/T Dave Richeson)

Should we get rid of the “=” sign in mathematics?, from Republic of Math.

A mathematical challenge to obesity: the New York Times’ Claudia Dreifus speaks with Carson Chow of Pitt, who has a blog.

A real-life example of a bimodal (or trimodal?) distribution

There are currently 174 books on my Amazon wishlist that I could order directly from Amazon. (My wishlist has a total of 195 books, but 21 are only available from other sellers.) Total price is approximately $3,549 (I rounded all prices to whole dollars), for a mean of approximately $20 per book.

But the median price of a book on my wishlist is (again to the nearest whole dollar) $16; the difference between the median and the mean is a hint that the distribution is skewed. And there are actually two peaks — one centered on $10 and one centered on $16-17. The distribution looks like this:

I’ve cut off the histogram at $100, which omits Mitchell’s Machine Learning at a list price of $168.16. Here’s a zoomed-in version omitting the 23 most expensive (all those over $30):

The two peaks are easy to explain: paperbacks and hardcovers, respectively. The long right tail is pretty much exclusively made up of technical books. I’d suspect that for those who read a lot but don’t buy technical books, the bimodality holds up but there’s a lot less skewness.

(If you look closely you might see a third peak, at around $60, but in a data set of this size I’m not sure that’s real.)

This is a much less depressing example than my standard example of a bimodal distribution, salaries of first-year lawyers.

Sedgewick slides on “Algorithms for the Masses”

Robert Sedgewick has the slides for a talk, Algorithms for the Masses on his web site.

My favorite slide is the one titled “O-notation considered harmful” — Sedgewick observes that it’s more useful to say that the running time of an algortihm is ~aNc (and back this up with actual evidence from running the algorithm) than to have a theorem that it’s O(Nc) (based on a model of computation that may or may not be true in practice).

The serious point of the talk, though, is that everyone should learn some computer science, preferably in the context of intellectually interesting real-world applications. This is what Sedgewick is doing in his Princeton course and in his book with Kevin Wayne, Algorithms, 4th edition, which I confess I have not read. There’s a Coursera course, in six-week parts, starting in August and November respectively. For a lot of the heavy-duty mathematics you can see Sedgewick’s book with Flajolet, Analytic Combinatorics, a favorite of mine from my grad-school days. There’s even a Coursera course: 5-week part 1 in February 2013 and 5-week part 2 in March 2013.

Weekly links for May 13

Persi Diaconis, The Markov Chain Monte Carlo revolution. “To someone working in my part of the world, asking about applications of Markov chain Monte Carlo is a little like asking about applications of the quadratic formula.”

Depression, relatively speaking: you’re more likely to feel disabled by your depression if you believe that you suffer from severe symptoms relative to the rest of the population. (And we all know people are bad at estimating where they fall within distributions.) Link goes to Wall Street Journal; longer writeup at Neuroskeptic.

Ethan Fosse, sociology PhD candidate at Stanford, converted from Stata to R. There’s actually a whole web site devoted to converting to R, rconvert.com, mostly aimed at businesses that have a lot already built within the framework of proprietary technologies.

Republic of Mathematics has a roundup of links under the title “So you want to be a data scientist?”

Joshua Ganz writes about his 11-year-old son’s experience taking Stanford’s online game theory course. (Did you know you can preview a few hours of the lectures?)

Laura McLay (Punk Rock OR) asked her stochastic processes students to find the size of a zombie population during an outbreak.

Rick Wicklin on methods of testing your answers in statistical programming.

<a href="http://www.youtube.com/watch?v=w6xIVJe5tc4&feature=plcp&quot;.George Hart builds a do-deck-ahedron. (Is he taking inspiration from his daughter Vi in starting to make videos?)

The 4-peg tower of Hanoi.

How do you know if someone is great at data analysis?

Every major’s terrible.

Samuel Arbesman had a book which skipped from page 182 to page 215, which got him thinking about the math of bookbinding.

John D. Cook asks> how long it takes a knight, moving at random on a chessboard starting at a corner square, to return to its starting square. A couple days later he gives solutions.

Courses using Alpaydin’s machine learning textbook

I’m taking the machine learning course being taught by Andrew Ng at Coursera. At times it’s a bit light on the theory for my tastes, which is understandable, so I’ve been looking to other sources. One that I’d come across previously that I ended up buying is Ethem Alpaydin’s Introduction to Machine Learning.

But Alpaydin’s book has its own problem: a relatively small number of exercises, and no data. So it seems useful to find more exercises and people who have written data-based exercises to go with the book. The obvious place to do this is to find courses that have been taught using this book, so I decided to compile a list of such courses. I make no claim that this list is anywhere near a complete list; it was compiled by half an hour of Googling. But if I was going to make such a list, it seemed good to make it available.

Incidentally, I think in general it would be good to have lists of web pages of “courses taught using book X” available, both for learners (who might want to see supplementary resources, get a sense of which sections of a book are more or less important, and so on ) and for teachers (to see how others have organized their courses).

Here’s the list:

Incidentally, a lot of courses in this area seem to recommend more than one text, because it’s a rapdily growing area. Others that seemed to be mentioned a lot in the same breath as Alpaydin are Bishop, Pattern Recognition and Machine Learning and Mitchell, Machine learning.

(Thanks to Brent Yorgey for a correction.)

OKCupid on gay and straight

OKCupid has dug into their dataset and has looked at gay sex vs. straight sex. (Safe for work, unless charts aren’t safe for work.)  It turns out that at least in their data, men and women are equally promiscuous.

Back in 2007, and in more mainstream data sets, the numbers were different. The numbers seemed to vary from population to population, but one thing was consistent: men reported having had twice as many female sexual partners as women reported having male sexual partners. The obvious explanation, of course, is that people lie about how many sexual partners they’ve had, and that men and women lie in different directions (men adjust their number upwards, women downwards).

But this doesn’t show up in the OKCupid data set: the median number of sexual partners for both straight men and straight women, in their data set, is six. This is also the median number of sexual partners for gay men, and for gay women – OKCupid actually points this out to make the point that gay people are no less or more promiscuous than straight people. If you object to comparing medians, they actually give the whole distribution curve; the distributions of number of sexual partners for OKCupid-using straight people and OKCupid-using gay people are substantially the same. (Not having the raw data, I can’t say if the difference is statistically significant, but who cares?)

Of course this only says something about the self-selecting pool of OKCupid users. But it seemed worth calling out.

A half-baked idea for modifying Scrabble scores

I’ve recently been listening to an excellent podcast on language from Bob Garfield and Mike Vuolo Slate, called Lexicon Valley. You may remember that back in March I pointed out that my name is supervocalic, i. e. it contains each vowel exactly once; in an early episode they ask a similar question, to find celebrities (Charlie Daniels is one example) who have the same vowels in both names.

In March they did an episode about Scrabble, a game which I’ve taken a renewed interest in because my girlfriend is much better at it than I am. But a large part of this is simply that she knows more obscure words than I do. Stefan Fatsis is the author of the book Word Freak: Heartbreak, Triumph, Genius, and Obsession in the World of Competitive Scrabble Players and a competitive Scrabble player himself, and was interviewed for the Scrabble episode of Lexicon Valley. Apparently the reliance of Scrabble on obscure words is seen as something of a problem in competitive Scrabble as well. North American players use a different word list than the rest of the world, and the North American list is shorter; some players don’t want to move to the longer list because they feel it contains too many obscure words.

One idea that occurs to me — although I don’t know how one would implement this — would be to modify the score that a word receives with some multiplier, a function of the frequency with which the word is used. (I wouldn’t use the frequency of the word itself; then Scrabble would reduce to seeing who can play THE the most.) But this would make scoring much harder — you’d have to pause to use lookup tables after every word. Computers, however, can handle this. More importantly it would make scoring much less transparent. This seems especially a flaw in the end of the game; with opponents that I’m well-matched with games can come down to the final few moves and I know exactly how many points my words will receive.

(And in case you’re wondering: if I had to name a baby I would lean towards first names that contain the vowels A, E, and I exactly once each, and no O or U.)