A couple of years ago, I talked about some of the search terms that people have used when they’ve found these pages. In the comments, Donna was amazed at the information that’s available to the web sites you visit.
My response to Donna’s comment discussed some stuff that’s available, but there are other issues too. Now, via BoingBoing, we hear about a new web site that’s trying to raise awareness about mechanisms that web sites can use to see what other, unrelated web sites you’ve visited.
You had no idea, did you, that you could come to these pages, and I could find out whether you’ve also visited, say, http://www.nytimes.com/, or http://www.schneier.com/blog/, or even https://www.bankofamerica.com/ ?
If you have visited any of those, note that the visited links show up in a different colour than the others (go ahead: click one, and see; they’re all safe, and point to the real sites). That difference is the key. The browser keeps track of which links you’ve visited, and treats them differently.
Let’s back up for a moment, to the early 1990s. It seems like just a few years ago to me, but it’s now over 15 years ago. Web pages were simpler then, oriented much more to text than to graphics, and set up so that the viewer had control of their layout.
That soon changed, though. Every business wants to control the look of its own web site, and a number of changes in HTML, the web-page language, came about. A key change, here, was the introduction of Cascading Style Sheets (CSS). These style sheets do things like
- set up the sidebar over there to the left;
- control the line spacing, the margins, and the paragraph indentation;
- choose the various fonts and font sizes and styles used in the text, and
- decide how lists are shown — it’s the style sheet that chooses those icons for the list bullets.
But consider those list bullets: they’re images, specified with a URL, like this:
Now, here’s the trick: the CSS can define a background image for any type of text — you might want a different background image for block quotes than for normal text, for instance. You can also define a background image for visited links. And, because CSS is so flexible, you can put a bunch of invisible links into your page and give each a name. The CSS can give each named link a unique background URL, and the loading of those background URLs can tell the web server which of the links the browser at the other end considers “visited”.
I’ll point out something that’s not clear from what BoingBoing and the “What the Internet Knows About You” web site say: they can’t download your link history wholesale. They can only probe for specific URLs. That means, for isntance, that I could see if you’ve visited specific New York Times articles, but I can’t just get a list of the articles you’ve visited.
Still, this “feature” enables quite a lot of data mining, and essentially can’t be avoided. The folks at What the Internet Knows About You are hoping that by publicizing this, they’ll get the browser makers to shut down the feature, not allowing web sites to probe the browser’s history in this way. That can be done by restricting what can be done with visited links, or by limiting when a link is considered “visited” (one mechanism they suggest is keeping track of visited links by referrer, so if you visit a New York Times article that I link from here, the same link will not appear as “visited” when you see it on, say, Bruce Schneier’s blog).
However it’s done, it seems that this privacy hole should be closed. We don’t know what web sites might be collecting information in this manner... but we wouldn’t, would we? It’s silent.