Monday, March 12, 2007

.

Polling, computerwise

I've had some enforced low-tech time recently, in which I've had a chance to read my recently purchased copy of Raymond Chen's book, The Old New Thing. It's a very cool book, and I recommend it to anyone interested in stories about internals and oddities about Windows, from someone who's worked in Windows development since the Norman Invasion. Raymond also has a very good writing style that makes the book (as well as his blog) fun to read.

Some things in the book have given me ideas for things to write about here, and I'll do that occasionally, as the mood strikes. For this first, I take off from a chapter about polling, which is based on this blog entry:

Polling kills.

A program should not poll as a matter of course. Doing so can have serious consequences on system performance. It's like checking your watch every minute to see if it's 3 o'clock yet instead of just setting an alarm.

“Polling”, of course, is a technique by which a computer program periodically checks the status of something to see if it has changed since the last time the program checked. Perhaps the most obvious example of this to the average user is when an email program such as Outlook Express checks to see whether you have new mail.

When a program polls, it checks at some pre-defined interval; sometimes that interval is fixed in the program, and sometimes you can change it by setting an option. The “check for new mail” interval is usually an option. But what to set it to? One mathematical aspect of polling is that, assuming that mail is equally likely to arrive at any particular instant, the average time between actual arrival and the next poll is half the polling interval. So if you set the program to check once an hour, on average mail will be in your inbox for 30 minutes before you're aware of it. You can have it check once a minute, so you'll only have to wait 30 seconds on average (and never more than a minute)... but do you really need to know about new email that quickly? And if you normally receive one email message every couple of hours, checking that frequently would mean that your computer, your network, and your email server are spending a lot of time checking for new mail when it's very unlikely that there's any there.

A better solution would be for something to tell your mail program when there's new mail, so it can go retrieve it without having to poll, and you'll find out about it immediately, without having to wait. Most people who use mail programs like Outlook Express use a standard protocol called POP (Post Office Protocol) to get their mail. Programs that use POP have to poll for mail, because there's no way for them to be told about it. IMAP (Internet Message Access Protocol) has an optional feature, which Outlook Express supports, that allows a mail server that also supports the feature to tell the email program when new mail is available. That's great, but most people don't use mail servers that support IMAP.

Without that, what are the consequences? Well, as Raymond points out, there are two main issues with polling. The obvious one is the waste of computer time and network bandwidth to do the polling. For some applications that's a serious problem, and it would certainly be a problem for a mail server that has 10,000 active users logged in if all of them polled once a minute! But the second issue is one that would be more apparent to you: your email program would always have to be occupying active memory in your computer. As soon as the operating system was ready to page it out — that is, decide that you haven't used it for a while, so it could be moved aside to make room for the programs you're actively using — it would wake up and look for new mail, and would have to stay in active memory.

That can have a severe effect on your computer's performance. Even worse is the case where it actually does get paged out, and then is quickly paged back in when it wants to poll. Your system can wind up spending more time paging your programs in and out than it does doing real work, a condition called thrashing.

Then, too, while the once-a-minute polling that your email program might do can put a nasty load on the mail server when aggregated over many thousands of users, there are instances of polling that can have a more obvious effect on your own computer. Suppose, for instance, to accomplish that once-a-minute poll your program did something like this pseudo-code:

loop: get current time
    compare with time of last email check
    if less than one minute, go to “loop”
    else go to “check for new mail”
That'd be quite horrible: your computer would spend most of its time polling the clock to see if it's time to poll the mail server (note that there's no “wait a bit” in that loop). Quite a waste, and, trust me on this, you would really notice how slowly your computer responded while that was happening. Happily, there are ways, in all modern computer systems and programming languages, to have the clock tell the program when a certain time has passed, so it doesn't have to keep checking.

And that's the correct answer to polling, as Raymond says. Protocols and interfaces should be set up to provide notifications when things that you otherwise might poll for have changed, and programs should use those mechanisms instead of polling. Of course, this is a problem for protocols (such as POP, noted above) that do not have those sorts of mechanisms built into them.

This leads into a general talk about paging, locality of reference, and thrashing, and also to one about protocols vs interfaces. But this entry's getting long, so I'll save those for other days.

No comments: