Debugging is the process of finding the reason a system misbehaves, making the appropriate correction, and verifying that this fixes the problem. In this article, I discuss how to find the problem. There may be later articles for other aspects of the bug fixing process.
There is not a lot of literature about debugging. One not-unreasonable book is Debugging rules, by David Agans. There is also little in the way of well-known systematic approaches to finding causes of problems. This article is based on my 29 years of programing experience of nearly daily debugging, various articles and books I've read over the years, and swapping war stories with other programmers. The overall structure follows that of the "Debugging rules" book, but using my words.
This article further concentrates on problems in a running program, when used by for real. Problems that happen during building or during test suites tend to be easier to find. What's more, test driven development helps with at that phase of a software's life cycle: there is usually only a small delta since the previous successful test run, so it's usually pretty obvious where the problem is.
Find a reliable way to reproduce the problem
Whatever the problem is, debugging is much easier if you can reliably, and preferably easily and quickly, reproduce the problem. If you can run the program in a given way, and it exhibits the problem immediately, you spend less time on waiting for the problem to happen again.
At the other extreme are problems that you don't know how to reproduce at all. This kind of problem can be almost impossible to fix.
There's no real recipe for reproducing bugs. It depends on the bug, and the circumstances in which it happens, and the program itself. Use your ingenuity here.
The kind of user who can make a program crash reliably, in the same way, every time, is extremely valuablue. Treasure them.
Binary searching for bugs
When you have your reliable reproduction recipe, it's time to find where in the problem the problem is. The basic technique for this is divide and conquer, which programmers know as binary search. You put a marker in the middle of a program's execution, and see whether the problem appears before or after it. Then you add more markers, dividing the execution into smaller and smaller parts, diving deeper and deeper into the program's logic. Obviously, it's possible to divide into more than two places at once, to find the location more easily.
Eventually you find the problem.
Unless you're unlucky.
Sometimes this division doesn't work. The problem might be, for example, that the problem only exhibits itself in the second half of the program, but is actually caused by something that happens in the first half. This can be hard to find, since everything in the first half looks to be in order.
Don't guess, don't assume, watch what's really happening
A common problem in debugging is for the programmer to act based on their mental model of what's happening. They think they know what's going on, and then they do things based on that. This is dangerous: the map is not the terrain, and the main reason for bugs is the difference between the programmer's mental model and reality.
When you're in debugging mode, you should ignore your mental model, and look at what the code actually does. This is difficult to do, but it is usually required to find the bug.
Debuggers versus logging/print statements
There are two kinds of programmers: those who use debuggers, those who use logging or print statements, those who can count, and those who use whatever tool is best for the situation.
A debugger is a tool that looks at a running program and lets you control the execution, and examine (and sometimes alter) the internal state of the program. In other words, it lets you run a program, stop it at any point, and look at the values of variables at the point. See our recent article on gdb for an example. The more advanced a debugger is, the more ways it provides for this basic task. For example, the stopping points (breakpoints) may be condition: the program will only stop if an expression using values from the program being debugged is true.
The other common approach is to put statements in the programming being executed to print out the values, either to the screen, or a log file of some sort.
Both approaches are valuable, and you should learn to use both. A debugger can be very efficient for zeroing on a problem, when you know how to reproduce it and need to find out what's happening. Log files are especially useful for long-running programs, and for analysing problems from production runs. Debuggers are the tool of choice when modifying the program is difficult or impossible. Log files are the right thing when running a program under a debugger isn't possible, or if the debugger would make the program run too slowly to be useable. Sometimes a combination works best.
Don't change the software more than you have to
When debugging a program, you should keep the changes you make down to as few as possible. Any change you make may affect the program in surprising ways. You should avoid the temptation of fixing even simple things, such as typos in comments, just in case. If nothing else, getting deep into a stack of changes will distract you from the task at hand (queue your yaks, don't stack them). Instead, make notes anything you want to fix so you can get back to them later.
If you make a change, and it doesn't help you get closer to the problem, undo it.
Keep an audit trail
The human short-term memory is about seven items large. That's not a lot. A debugging session may easily overflow that. You should keep an audit trail so you remember everything you've done, anything you've tried, and what the result was. Did you do a test run with the volume set to 11 or did you just think it would be a good thing to do?
An electronic journal is very useful for this. Copy-pasting code snippets, log files, screenshots, etc, is helpful. This becomes especially important if a debugging session takes days or weeks, since in that case it is guaranteed you will forget most of what you've done. (See the Obnam journal snippet for an example.)
Further, any changes to the code you make, you should commit into a version control system, using as tiny commits as you can. Make a new branch, where you can safely make any changes you want. Make several, if need be.
If you can't reproduce a problem, this can itself be valuable information. Find the differences between your system and the system where the problem shows up, and eliminate them one by one.