Thursday, August 31, 2006

.

Privacy, technology, and research

A couple of recent news items interested me particularly, because both are related to privacy issues we've had on projects I've recently worked on. The two aren't related to each other directly, but they come together as they point out problems with technology and privacy. One item is from NPR, about a program in Philadelphia to use GPS tracking in taxicabs. The other is from the New York Times, about concerns about using AOL's search data for study and research.

The first, the GPS system in the taxi cabs, relates to a project I worked on a couple of years ago involving "context-based services". The idea is that "user context" is information about the user (actually, any entity, not just a "user", can have context) that changes over time, and that we can collect context information and use it to make services work better for the user. The user wins here: by allowing us to collect the information, the user gets better service.

The most obvious context information is the user's location, but we had other ideas that included items synthesized from other data. Is the user on the phone? How long since the user has used her computer? What's on the user's calendar right now? And so on.

For tracking location, we had a number of mechanisms, some of which worked in the office, some outside; some of which gave us information of a different granularity than others. Your "location" can be as coarse as "in the office" or "not in the office", as accurate as a GPS device can give, or somewhere in between, such as the ID of the cell tower your BlackBerry is talking to. For the last, I wrote a program that runs on the BlackBerry and sends information to our Context Server each time the tower ID changes.

The cabbies' situation is similar, though not the same. The service they get is GPS navigation, but if you listen to the NPR item you'll see that many of them don't want that. And they're probably right that a smart driver, one who knows the roads and knows what to expect, will beat an automated system nearly all the time. But what they give up is privacy, and not just while they're driving fares around, but any time they're in the cab.

My BlackBerry program allowed the user to turn off the tracking, though it didn't have to allow that. Our Context Server had all sorts of access controls on the data. It also could be set up not to save raw data, but only to keep synthesized context information. For instance, if we wanted to keep "in/out of office" as a piece of information, but not maintain actual location... then as soon as the location was used to determine whether the user is in the office or not, the location itself would be discarded.

The taxi drivers likely don't have those sorts of options and assurances. Yet even if they did, there's the question of whether they trust the system, the administrators, the bosses... the government. The problem here is that once this information can be collected, it's hard to ensure that it will not be misused. And even if you can, it's harder to convince people that you have.

And that brings us to the AOL search information. The well-meaning AOL employee (now "ex-employee") who released the information did so for the right reasons: the data would be a very useful thing to research, and, after all, he made sure the user identities were not attached to the data. The problem is that the data was not anonymized.

To get maximal value out of the data, all searches performed by the same user had to be kept together, even if the user's identity has been removed — it's particularly useful to know that the same user who searched for "world cup" also searched for "Manchester" and "Beckham". The trouble with that is that if you searched for names, addresses, local businesses, hobbies, and so on, it would be easy to build a profile of you from the data. Sometimes that resulted in easy identification of the actual user. And that caused some outrage, and the removal of the data (and a sacking). Unfortunately, the horse had already left the barn when AOL closed the barn door.

This relates to our need in our antispam work to have labelled sets of email messages — one set of known spam, and one set of known "good" mail. It's easy to collect spam, and we have plenty. Most people are even happy to give us theirs, once they understand that spam doesn't represent their browsing habits (the fact that you get pornographic spam doesn't mean we'll think you surf porn sites). Good mail is another story.

Few people are willing to send us good mail, and even when they are, anonymizing it without removing too much useful information from it is difficult. Some of that's been done with the publicly available "Enron corpus", but that's limited, and it's of limited use. There's mail sent from Enron people to other Enron people, but it doesn't give us a representative sample of the legitimate mail that's floating around the Internet.

And it goes back to the same privacy questions. If you give us your email, what will we use it for? Will we remove identifying features from it? Will we remove features that can identify others who correspond with you? Will we correlate the various pieces of your mail, allowing you to be identified anyway? Will your employer use it against you? Will the government come looking for it? How safe is your information? How safe is your privacy?

Those last questions come up in many contexts now. Telephone call logs have long been used by the police, after they obtain a court order. But recently we've seen that the government can get them without going to a judge, and we've heard how easy it is for people to trick the phone companies into providing the information. Automatic toll-paying systems keep track of where you've been. Credit-card purchases are tracked.

There are many ways in which your privacy is exposed, and each advance in technology, each new feature to make your life easier comes with a corresponding threat to your privacy. It's our challenge, in research and development, to create these technologies with security, integrity, and privacy considered. There are ways to safeguard the information, and we have to make sure those safeguards are designed in from the start, and that they're protected against attack by fraudsters and against abuse of authority.

No comments: