My approach to bugs

2010-11-24

As the recent spate of bug fix and patch releases shows I’m not scared of talking about the bugs that I find in the code of The Server Framework and pushing fixes out quickly. It’s my belief that the most important thing to get out of a bug report is an improved process which will help prevent similar bugs from occurring in future and the only way to achieve that is to be open about the bugs you find and equally open about how you then address them and try and prevent similar issues. Every bug is an opportunity to improve. Sometimes I wish I had fewer opportunities…

I have a ’no known bugs’ approach to software development. As soon as there’s an open bug report that is the single most important thing that needs working on. It doesn’t matter if there’s a deadline to hit for unrelated code, the bug is top priority. The reason for this approach is that over the years I’ve found that the worst way to try and hit a deadline for code delivery is to ignore your bugs; they always come and bite you at the wrong time. What’s more, having a ’no known bugs’ approach from day 0 means that you don’t accumulate open bug reports. Sure your perceived rate of progress towards the next delivery may be slower as you fix all of the open issues but your actual rate of progress towards releasing the product is faster and a more reliable indicator of when you’ll really be done. I know that retro-fitting a ’no known bugs’ policy onto an existing code base is hard work, but it pays off eventually.

A good bug report, ideally with a fix included, is the best thing that I can get from a client. Even if there isn’t a fix, a clear and concise description of the problem and a way to reproduce it is priceless. I’m sure that for every bug that is known about and reported there are others that have yet to be found and so the first thing that I do when I get a bug report is to adjust my testing to show the problem; if possible I adjust HOW I test rather than just the unit tests in question. Is there a way that the style of testing that I was practising when I developed the code in question could have led to similar bugs with a similar lack of test coverage elsewhere in the code base?

Taking the most recent bug report as an example I immediately decided that I needed to know why I wasn’t covering this particular branch of the switch when I knew that I had pretty good coverage of the class concerned. I also started to work out how to adjust my newly adjusted release test scripts to incorporate running all service examples as a service on at least one of my build machines.

DevPartner Studio’s code coverage tool showed me that I had 73.7% coverage of the class with the bug and that only RunActionRunAsExe and RunActionRunAsService were not covered. I tend not to run a coverage tool as part of my unit testing as I’m wary of using coverage as a measure of testing goodness (it’s not just coverage of lines executed it’s the coverage of the various combinations of code execution that I really need to see and I haven’t yet found a tool that can show me that). 100% coverage doesn’t mean 100% of all possible combinations and so is pretty meaningless. Also a tool that I have to run manually in the IDE is less use to me than a tool that can be integrated into the build and test cycle that my build machines run. Like with the recent addition of the lock inversion detector, if I have to run it manually it doesn’t get run! However, that said, missing two large pieces of functionality completely is also rather poor so these tools do have a place in the process. Annoyingly it seems that I was using coverage tools better back in 2004 than I am now…

Adding unit tests to cover the missing functionality was pretty easy, it brought the code coverage up to 86% (49 of 57 lines) and exposed the bug which I then applied the fix to so that the test could pass. This brought the number of tests for the service tools library up to 98 from 89 as I also added several similar tests to the multiple instance service manager class. It’s always tempting at this point to try and hit that 100% figure, even when I know it’s meaningless, the remaining code that isn’t covered is all in error paths and all fairly simple and so right now I’m not going to add additional tests to cover it as that would, IMHO, be a pretty academic exercise. Now, if only I had an intern with nothing else to do…

The fact that adding the new tests was easy was reassuring. I’d already covered most of the similar functionality so all of the decoupling and mocking was already done to support the other tests. In fact the mock service control manager interface class (the class which decouples my service code from the operating system’s APIs and so allows me to test ‘running as a service’ without actually needing to call the underlying API) had a bug in the call that is only needed by the new tests that I added. It seems to me that I simply stopped testing this piece of code too soon and that I hadn’t written the final couple of tests. In this situation running the coverage tool would have helped me as it would have been pretty clear that I wasn’t done.

The fact that the pre-release testing of the service based example servers also didn’t show up this recent bug means that my pre-release testing needs to be adjusted. Right now the service examples are run using the /run switch which runs them as normal executables. This avoids the various complexities of having the test install the service and then start it using the SCM. It also avoids testing the service code itself, which is unfortunate and needs to be changed. The reason that I took the route that I did is that I tend to run all my build machines with the least privileges possible. Installing a service as part of the test would require running as administrator. For now I may just adjust the tests so that they can run on the XP test machine as that runs with admin rights, however a longer term solution needs to be integrated into the build.

So, what have I got from this bug? First that I need to integrate the manual process of running the coverage tool into my process. Ideally this would be an output of the build machines, an alert email every now and then when the coverage numbers drop by a threshold could help me move in the right direction without drowning me with information that I already know “your coverage isn’t enough” or lulling me into a false sense of “I have 100% coverage” security. Secondly know what you’re actually testing; the black box tests for the service examples aren’t testing the service specific functionality and so are less use than they could be. Finally, bug reports are a positive thing, use them to drive process improvement and self analysis and don’t be scared or shy of talking about them and thinking about their consequences.