Friday, November 03, 2006

.

Psychic debugging

In this post, Raymond Chen finishes a story about some hard-to-find software problems this way:

Most of what looks like psychic debugging is really just knowing what people tend to get wrong.
I'd never thought of the term "psychic debugging", but, yes, it's a combination of a certain skill and a lot of experience with the root causes of most of the problems.

This wasn't a software problem, but one time, for instance, when I was the support guy for a mainframe computer system, I got a call from a user who was beside herself because "the function keys stopped working on my terminal!" Ah, I said, do you have a 3290 terminal — the plasma screen with the orange characters? You do, OK. Is there a little symbol that looks like a double-arrow pointing up in the status area? OK, hold down the "alt" key and press F19. Did that little symbol go away? It did, good. Do the function keys work now? Wonderful. And from the other end of the phone came, "How did you know?"

But back in the bad old days of programming, by far the most common program errors, apart from just plain program-logic errors, were

  1. uninitialized variables and
  2. exceeding the bounds of an array.
Both caused memory to be "stepped on", both manifested themselves as seemingly random, unpredictable program failures, and both were terribly hard for most people to debug in a program of any complexity. But I'd done enough of them that I knew how to find them.

I got a call at home one Sunday morning from a user who desperately needed help. "I know it's not your responsibility, and I know it's Sunday," Dennis began, "but I'm really stuck. I have a bug that I've been working on finding for almost two weeks now, and I have a demo tomorrow morning that just has to work. You're good at this stuff... could you possibly come into the office and help?"

What could I say? I knew I really was his last resort at this point, and he did spend the better part of two weeks trying to find it himself, so I said sure, I'd come in and look at it. And he warned me that one nasty aspect of it is that the program had to run for about an hour and a half before it crashed — a pretty good sign, I knew, that he was running off the end of an array.

The program was written in FORTRAN, which had (at least at the time, the FORTRAN '77 language level) very limited debugging mechanisms for this sort of thing. But the operating system we used, VM/SP, had some things that helped a lot. I got started by compiling the program with some extended listings and setting a breakpoint in VM that would stop when the program crashed. We started it running and I went to my own office to get some work done while we waited. By and by, Dennis came back and said that it had crashed and was stopped at my breakpoint. Yes, nearly an hour and a half.

I poked around in memory, checked the program listings, and found the right place to set another breakpoint. "We'll need to run it one more time," I said, "and then I'll have the answer." Back to my office for another hour and a half. This time when Dennis came back, the program was stopped at the precise instruction that was storing data past the end of the array. I checked the address, looked in the program listing, and pointed at the line of FORTRAN code. I checked the memory register for the array index. "OK, Dennis, it's storing something into entry 10,001 of array X." "Damn!", he replied, "I should have known! That array used to be 10,000 entries, and I changed it to 20,000... but I guess I changed the loops and forgot to change the dimension declaration."

Dennis went off to fix the program, and I went home to finish my Sunday, telling him to feel free to call me again if there was still a problem. He didn't call.

On Monday morning I was sitting in my office, and my manager stopped by with his coffee. We sat and talked for a bit. Suddenly, Dennis came into my office, dropped to his hands and knees, and pretended to kiss my feet, saying, "Thank you, thank you, thank you, thank you! You saved my life!" He crawled out, still on his knees.

Matt sat there for a moment, silent, then said, "Do they do that often?"

2 comments:

Anonymous said...

Pascal in a giant "weeder" intro programming course in the early eighties. Student asked the assembled multitude for help, so we spread the listing on the table. A subroutine and a main routine, each about two pages on large-format fanfold.

Several people standing around reading it, the author responding to questions, the usual drill.

Cut to the chase: he had a manifest constant named "one" with the value 1 declared in one module. In the other he had an integer variable named "one" -- unitialized, of course.

My distrust of pointless manifest constants was the key to "psychic" debugging this one. Other readers accepted the named constant at "face value" - literally!

Barry Leiba said...

Reminds me of how FORTRAN (at least up thru the '77 level) used to allow you to change the values of constants by passing them to subroutines — not just changing the value of something like "one", but actually changing the value of "1".

It passed all parameters by reference, so if you called a subroutine with a parameter of "1", the address of the constant 1 was passed. If the subroutine then, say, incremented that parameter, the actual value of "1" would become 2 thereafter.

That caused some weird bugs!