Staring At Empty Pages: “Anonymous” data is not

Thursday, April 09, 2009

“Anonymous” data is not

Some researchers from University of Texas at Austin have done an interesting study (PDF here), which will be presented at next month’s IEEE Symposium on Security and Privacy. They took anonymized information about the Twitter social network. They took full (not anonymized) information about the Flickr network, knowing that there was some (but relatively small) overlap of users between the two networks. And they used the latter to analyze the former... and to identify many of the users in the “anonymized” Twitter network.

Here’s the paper’s abstract:

Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.
We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social network graphs. To demonstrate its effectiveness on real world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.
Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.

Note that last paragraph, in particular: they did not use any information from the tweets themselves, nor from the Flickr photos... only data about who was connected to whom. And it worked even though there wasn’t terribly much overlap between the networks. A relatively small amount of cross-network information allowed them to find the common points and to identify many of the users.

This is related to the issue a couple of years ago, when AOL released anonymized search data, and people were identified by it.

The message is that aggregated information exposes us, and that simply removing some identifying information is not enough. Put another way, it says that when we aggregate information, it’s more identifying than we realize.

This is important because we often “remove personally identifiable information” without understanding how personally identifiable the result is. Courts have held to this, and sometimes have required that such “anonymized” information be made available. Companies make demographic information available all the time. The government obtained who-called-whom information from the telephone companies — information that’s exactly analogous to the social-network graphs that the University of Texas researchers had from Flickr. What else can they glean from such a rich oracle of social-network connections, when that’s added to other data that we previously considered “safe” from a privacy standpoint?

These researchers have shown us how careful we have to be, considering how easy it is to collect various pieces of information and putting them together into a whole that’s far more than the sum of the parts.