Friday, March 19, 2010


Once more, on anonymizing data

How many times do we have to hear (or read) this sort of thing before it sinks in?: “Anonymized” data that’s aggregated in any significant way is no longer anonymous. We saw it in 2006, when AOL briefly released search data. We saw it in 2009, when researchers used information from Flickr to identify users in a Twitter network

And, now, Netflix has rediscovered it, in their attempts to have researchers data-mine their customers’ rental data:

But it turned out that letting very smart computer scientists and statisticians dig through the video rental site’s data had one major, unforeseen drawback. A pair of researchers at the University of Texas showed that the supposedly anonymized data released for the contest, which included movie recommendations and choices made by hundreds of thousands of customers, could in fact be used to identify them. [PDF]

So let’s say it again, all together now: when a lot of data about you can be put together and connected, the fact that each individual item has been “anonymized” is not sufficient to protect your anonymity (or any other sense of your privacy).

A friend of mine figures that it doesn’t matter anyway: as soon as computers came around and we started networking them and using the results, “all bets were off” on the idea of privacy. And that’s largely true. Still, we should be doing what we can to rein at least some of it in.

On the other hand, Google has my search data, my photos, and my email directly, along with whatever they get indirectly through agreements with other services. They probably know more about me than I know myself.

