New Scientist is usually pretty good, if too cursory, but they really get it wrong this week, when they talk about image spam:
Computer security experts are struggling to cope with a new type of spam sweeping the internet. The emails can bypass conventional spam filters because they contain images of messages rather than actual words and sentences.This stuff isn't "new", in any sense. We (that is, those of us who fight spam for a living) have been watching it grow and working on filtering it for at least a couple of years now. Saying it's a new problem is rather like saying that paying for the war in Iraq is a new problem facing our legislators.
The article does quote a claim that these sorts of messages constitute 40% of spam now, up from 18% at the beginning of the year, and that "That's a big increase." OK, it sure is, assuming that those figures are accurate (I have no data on that). But given that one estimate says that around 30 billion pieces of spam were sent per day in late 2005, even at 18% that was 5.4 billion image-spam messages every day. No, this is not a new type of spam.
We're also not without mechanisms to block it. The article mentions mechanisms based on routing information, which we've been successfully using on all types of spam. It also talks about optical character recognition techniques as a pie-in-the-sky method that's a futuristic goal. The fact is, though that
- we have image recognition techniques that do not actually pull the text out of the image, but that do recognize related images and are successful as filtering tools, and
- character recognition is more successful than one might think.
Considering the latter: researchers at Microsoft presented a paper at the 2005 Conference on Email and AntiSpam that showed that in their study of text-based CAPTCHAs, Computers Beat Humans at Single Character Recognition in Reading-Based Human Interaction Proofs (and that's two-year-old research now). That's not to say that we've got the problem solved, but simply that the problem space isn't as straightforward as it seems.
I strongly question the "10 to 30 years away" claim for OCR in the New Scientist article; I think it's lots closer than that — which says not only that we should soon be as good at detecting this sort of spam as we are at purely text-based spam (which we still can't detect 100% of the time, of course), but also that we should not be relying on character-based CAPTCHAs (which I find obnoxious anyway).