Thursday, December 15, 2011

.

Patterns in randomness: the Bob Dylan edition

The human brain is very good — quite excellent, really — at finding patterns. We delight in puzzles that involve pattern recognition... consider word-search puzzles, the Where’s Waldo stuff, and the game Set. We’re also great at giving patterns amusing interpretations, as we do when we fancy that clouds look like ducks or castles — or when we claim to see images of Jesus in Irish hillsides, pieces of wood, paper towels, and store receipts. Remember the cheese sandwich with the Virgin Mary on it, which sold on eBay for $28,000 in 2004? Miraculous, indeed.

It’s with the knowledge that we find apparent patterns in randomness that I approach this puzzling aspect of the random play feature of my car stereo. I’ve stuck in a microSD card that has about 4000 songs on it. I’ve put it on random play. And it appears to be playing songs in random order.

But it sure seems to be playing a lot of Dylan.

Bob, not Thomas. I like Bob Dylan, of course; that’s why I have quite a bit of him on the microSD card. But, for instance, on one set of local errands, it played two Dylan songs, something else, another Dylan, two other songs, then another Dylan. Four out of seven? Seems a bit odd.

Now, I know that if you ask a typical person which sequence is more likely to come up in a lottery drawing, 1-2-3-4-5, or 57-12-31-46-9, he will say not only that the latter is more likely, but that if the former came up he’d be sure something was amiss. In fact, they’re equally likely, and are as likely as any other pre-determined five-number sequence, but the one that looks like a pattern is one we think can’t be random. Similarly, it’s certainly possible to randomly pick four Dylan songs out of seven — or even four in a row, for that matter. And if there’s a bug in the algorithm that the audio system uses, why would it opt for Dylan, and not, say, Eric Clapton or the Beatles, both of which I also have plenty of on the chip?

So I played around with some numbers. Let’s make some simplifying assumptions, just to test the general question. Assume I have 20 songs from each artist, and a total of 4000 songs (and, so, 200 artists). If I play seven songs, how likely is it that two will be by the same artist?

It’s easier to figure out how likely it is that there won’t be repetitions. The first song can be anything. The likelihood that the second will be of a different artist than the first is (4000-20)/3999, about 99.5%. The likelihood that the third will differ from both of those is (4000-40)/3998. Repeat that four more times and multiply the probabilities: there’s a 90.4% chance of seven different artists in seven songs... meaning that there’s about a 9.6% chance of at least one repetition. Probably more likely than we might think.

Let’s look at Dylan, specifically. I have about 120 of his songs on there (3% of the total; maybe I should delete some, but that’s a separate question). What are the chances of having no Dylan in seven songs? No Dylan for the first is 3880/4000, 97% (makes sense: 3% chance of Dylan in any one selection). Continuing, no Dylan, still, for the second is 3879/3999. Repeat five more times and multiply: 71.3% chance of no Dylan, so there’s a 28.7% chance of at least one Dylan song if we play seven.

What about the chances of at least two Bob Dylan songs... a repetition of Dylan? Well, we figured out no Dylan above. Let’s figure out exactly one, and then add them. For the first to be Dylan and none of the others, we have 120/4000 * 3880/3999 * 3879/3998 * 3878/3997 * 3877/3996 * 3876/3995 * 3875/3994. About 2.5%. It’s the same for one Dylan in any other position — the numerators and denominators can be mixed about. So the chances of exactly one Dylan song out of seven is 2.5 * 7, or 17.5%. Add that to the chances of zero, 71.3 + 17.5 = 88.8%, so there’s an 11.2% chance of at least two Dylan songs in a mix of seven songs.

In other words, it’s a better than one in four chance that I’ll hear at least one Bob Dylan song, and a better than one in ten chance that I’ll hear at least two of them every time I take a 20- or 30-minute ride. Thrown in some confirmation bias, where I forget about the trips that had Clapton and the Beatles and Billy Joel and Carole King, but no Dylan, and I guess the system is working the way it’s supposed to.

But, damn, it plays a lot of Bob Dylan!

10 comments:

Call me Paul said...

I notice similar strings of play when I listen to my iPod. It seems to favour a specific artist for a while, then later it will favour a different artist. I suspect confirmation bias and the pattern seeking nature of our brains play a big part. I also suspect the algorithm used to select the songs isn't truly random.

Barry Leiba said...

«I also suspect the algorithm used to select the songs isn't truly random.»

I suspected that, too, and I'm still not sure. But the thing is, good code to generate pseudo-random numbers (technically, none of it is truly random, because it's done algorithmically) is readily available, and isn't rocket science anyway. Pretty much every software development kit has one, and they're all decent. You'd kind of have to go out of your way to do it badly.

If what we're observing really is there (as opposed to just appearing to be there), it's more likely that it's intentional bias that's been added in, in a misguided attempt to try to make the music style shift less abruptly, or some such.

Nathaniel Borenstein said...

Barry -- You wrote "good code to generate pseudo-random numbers... is readily available." That certainly hasn't been my experience. I've been looking for a good random number generator for use in shell scripts for years.

If you know of anything better for that purpose than "jot" for OS X, I'd love to hear about it. Even jumping through hoops to produce a highly variable seed, I see it produce the same results remarkably often.

Barry Leiba said...

Mm, I don't know about shell scripts. I'm thinking of Java, C, C++, FORTRAN, Pascal, and the like. I use Rexx for scripting (try ooRexx.org), and that also has a decent one.

I'm surprised that using something derived from a timestamp value for a seed doesn't give you adequate pseudo-randomness. Hm.

Nathaniel Borenstein said...

I've certainly tried using a timestamp as a seed. I started with (csh syntax, sorry):

set seed=`date +%s`

and when that didn't work tried further randomizing it with:

set seed=`expr $seed % $$`

And I still get crappy randomness. And if you can get good randomness in a compiled language, how hard would it be to write a utility (like "jot") to make it available in a shell script?

Katharine said...

Reminds me of a favorite old game of math teachers, to predict that in a given classroom there will be at least two people with the same birthday. I no longer recall the proof, but apparently the critical number at which this is likely to work is 23 people (and, of course, most classes have more than 30 students nowadays, so the teacher's odds are excellent).

Barry Leiba said...

See my update on this. The frequency of Bob Dylan was due to more than any error in randomness.

Call me Paul said...

@Nathaniel Borenstein: what are you using to judge the "randomness" of the results your code is giving you? "My own brain," is probably not the best answer, given the discussion above.

Brent said...

Actually, knowing Nathaniel, it would probably be a good answer :-)

Nathaniel Borenstein said...

Yes, Brent's right, my brain is pretty darned random so I know it when I see it. :-)

Seriously, I don't really see a better way to judge, given how hard it is to define randomness. I suppose one could write a program that said "the odds against getting this sequence randomly was X to 1."