How We Deal With Software Defects in Production
Yes, we have software defects in Production. Given that mobile.de runs on a platform with several million lines of code and that we offer a huge spectrum of applications and services for our customers, this probably comes as no surprise and is not particularly noteworthy.
Yet, there is an interesting story behind this whole matter, and the story centers around our belief that we are pretty good in dealing with this problem that every software-producing enterprise is facing. The most interesting part of the story is, however, that we were not always that good, and how we got to where we are now.
If you haven’t seen it already, this would now be a good time to go to our and explore a bit for, say, 10 minutes. Perform a few searches, put some car ads you found on your car park, click through the informational contents. You might even advertise a car, although I’d ask you to delete it afterwards - obviously we don’t want bogus car ads on our platform, because they spoil our customers’ experience.
As a software developer, you can probably estimate the complexity behind this platform. And what you see is only the tip of the iceberg. We have a number of front-end applications that are only available to our dealer customers, we have several APIs that are used by a variety of different customers and business partners, we have lots of internal applications and a zoo of backend jobs and processes.
As of the day I am writing this, we have 10 known software defects in Production in all these applications. I think this is a remarkably low number. What is also remarkable is that with a very high probability, about half of these defects will be fixed in Production half a week later and almost all of them a week later. Defects with a high impact on our customers are usually fixed on the day they emerge. And with “fixed” I do not mean “I, the developer, have a fix for this problem” but “We, the team, have a fix for this problem, have tested it in a preproduction system, and the fix has been rolled out in Production”.
It is not easy to contrast these facts with other comparable websites, as most do not publish information on how many software defects they have in Production. So use your own judgement and contrast the complexity you see on mobile.de with the number of known defects and the time it takes us to fix them. If you think we suck at it, or even if you think we’re mediocre, please add a comment to this post and explain why you think so.
History in charts
OK, enough self-praise. To be honest, a few years ago we were pretty bad at handling Production defects. In the summer of 2010, we had around 450 known defects in Production, and almost 300 of these were beyond their SLA due date, some for more than a year. It took us several years, many organizational changes and a lot of work to change it.
The effects of this change are best documented by three charts to which I will be referring a lot in this article. I will call them the “new tickets chart”, the “backlog chart” and the “lead time chart”.
The three charts all span the same time frame from June 2009 until September 2013, a little more than four years. It goes back to June 2009 because that is when we started tracking our production defects in a separate JIRA project (called “PROD”) - before that, we did not really distinguish production defect tickets and other kinds of tickets; we only had one project for defects and change request (the “MBL” project). These charts show aggregations of ticket data from our PROD JIRA project. I will first describe these charts before I interpret them.
In general, these charts are simplified a little in that until December 2012 we were distinguishing more different priorities - additionally we had MEDIUM (which is grouped with MAJOR in this chart) and TRIVIAL (which is grouped with MINOR in this chart). As the priorities define the SLA due date of the tickets, this simplification skews the data a little, but that does not change their overall essence. I just wanted to make the charts a little more digestible. If you want to see the charts for the full-blown priority scheme, contact me and I will provide them.
The “new tickets chart” below basically shows how many new “real” PROD tickets were created in a month. “Real” means that tickets that were not reproducible or turned out not to be defects are filtered out. Every stacked line represents the count for that month, and the different colors show the share of different ticket priorities.
This chart is not too spectacular. It shows that the average monthly new ticket count for 2009 was a little below 90, for 2010 between 60 and 50, for 2011 and 2012 between 50 and 40, and in what we have seen of 2013 so far again between 60 and 50. To sum it up, we were pretty bad in 2009 and have improved a lot over the following years. The ticket production rate seems to have accelerated a little in 2013 (to be frank, we don’t have an explanation for this). It should also be noted that the size of the Technology department has gradually grown by about 30% since 2009, so that the average number of defects per developer per month has dropped continually.
The “backlog chart” below shows the number of unresolved tickets in the PROD project (the green line) and how many of these were beyond their SLA due date (the red line). Both lines start with initial values higher than zero because we had been fading in the PROD project in the previous months.
The change in backlog chart is more spectacular. The initial rise is owed to the previously mentioned fading-in of the PROD project, but it is already an indicator that it took us far too long to fix defects. It basically shows that we accumulated new tickets in this project without fixing them, because we were still busy fixing the remaining defects is the old MBL project. At the end of 2009, all the defect tickets in the old MBL project were closed, and now the backlog in the PROD meanders up and down for about a year. Then, in the winter season 2010/11, the backlog dips steeply to a dramatically lower value, and then continually shrinks to the current level of between 20 and 10. Furthermore, we are able to process practically all tickets within their SLA periods. The SLA compliance deteriorates a little at the beginning of 2013, but is under control again mid-2013.
The “lead time chart” below shows the average lead time for all “real” tickets that were closed within a month, by priority (the red, orange and green lines for BLOCKER, MAJOR and MINOR tickets). The lead time for a ticket is the number of days between the day it was reported and the day it was closed (= fixed in Production). The scale for these averages is the y-axis on the left. The black line is the cumulative “SLA debt” for these tickets. What I mean with “SLA debt” is the number of days beyond the SLA due date it took to fix the defect. So if a ticket was closed within the SLA period, the SLA debt for this ticket was 0. If it was fixed 2 days after the SLA due date, the SLA debt for this ticket was 2. The black line shows the sum of the SLA debts for all the tickets that were closed within that month. It can be seen as a measure for how good or bad we are at complying with our self-defined SLAs. The scale for this line is the y-axis on the right.
The change in the lead time chart is as spectacular as the one in the backlog chart, and the point in time when the change occurs is, as in the backlog chart, the winter season 2010/11. Before that point in time, as well the curves of the average lead times for the priorities as the curve for the SLA debt jump up and down rather erratically. Afterwards, the curves for the priorites slope down gently to the current values. The SLA debt curve rises a little again at the beginning of 2013, to be under control again by mid-2013. This corresponds to the SLA compliance curve in the backlog chart. Note that the SLA debt/compliance curves show this rise for this half year despite the fact that lead times remain execellent - this is another interesting change that will be explained.
So there are two interesting pivotal points in time that require an explanation: the winter season 2010/11, when the backlog shrunk, remained under control thereafter, and when we started being able to comply to our SLAs. And the beginning of 2013, when something temporarily affected SLA compliance. Let’s look at what happened at those two dates.
How we changed to the better
What happened in September 2010 is that we discontinued having a Maintenance team in charge of Production defect fixing. It was actually not a choice we made - legal reasons forced us to cancel our contracts with an outsourced development partner who, at that time, was responsible for the Production defect fixing. Instead of putting one internal team on defect fixing shift, we decided to distribute the responsibilities. We did this to address several problems with the Maintenance team setup:
We assumed that it might be beneficial to introduce a feedback loop to the developers. The Maintenance team (whether outsourced as at that time or internal as in the years before) had silently been fixing defects that were created by the development teams, so effectively the people who created the defects were often not even aware of them, increasing the probability that they would make the same mistakes again.
If you have to fix a defect for code that you did not create, the time to analyse and fix the defect is much higher.
Being on Maintenance shift was an extremely unpopular and demotivating assignment. In the face of this mountain of defects in the backlog, it felt like chipping away rock from Mount Everest with ice picks.
It was also obvious that we assigned too few resources to defect fixing. This is simple math - if you produce more tickets than you can fix in time, the defect fixing is understaffed. We would have needed a much larger Maintenance team for that, which would have reduced the number of other development teams, and thus the number of feature projects you can run in parallel.
A fixed team size for a Maintenance team also means that you have scaling problems - if you have a small backlog, the team would be idle at some point (admittedly, that would have been a problem we would have liked to have), and when the backlog rises above a certain size it means that you cannot fix all tickets within their SLA.
Additionally to distributing defect fixing, we set up a proper process for handling the backlog, which has been refined since (I will describe this process later). The choice of tickets that actually went into fixing seems to have been rather random before. You can see this in the erratic nature of the lead time chart before 2011. The big spikes for MINOR tickets and SLA debt curves come from the fact that somebody decided to pay some more attention to very old MINOR tickets for a month or two and then focus again on the MAJOR tickets that had piled up in the meantime.
The effect of the organizational change can be clearly seen in the backlog and lead time charts. I do not refer to the sharp drop at the beginning of 2011 - this is admittedly caused by two one-time actions: descoping a large number of non-customer-facing MINOR tickets, and temporarily higher efforts from the teams to reduce their part of the backlog. What I mean is the fact that we were able to process effectively all tickets within their SLA period, and that the backlog was under control. There are a few minor spikes and dips, but they are miniscule compared to the ups and downs in 2010. It shows a very smooth flow despite the fact that the new tickets chart shows fluctuating production rates.
The slight bump in the SLA compliance curve in the backlog chart and the SLA debt curve in the lead time chart at the beginning of 2013 is not the effect of an organizational problem but rather the effect of our wish to improve. We wanted to take the promise to our customers to fix defects speedily to the next level. Effective January 2013, we reduced the SLA periods drastically. Before 2013, we had up to 90 work days for a TRIVIAL ticket, and still 15 work days for a MAJOR ticket. Now that the process was firmly established in the organization had the backlog was under control, we discontinued using an intricate priority scheme and now only use BLOCKER (“drop everything you’re doing and fix it right now”), MAJOR (“start working on it today or tomorrow”) and MINOR (“can wait a few days”). The SLA period for a BLOCKER is 2 work days, for a MAJOR 5 work days and for a MINOR 10 work days.
It took the teams several months to adapt to this new SLA scheme. Admittedly, complying to the previous SLA scheme was pretty much a no-brainer no-effort task now that we had an effective handling method, and it was easy to fit the tickets into the teams’ ongoing development backlogs. It was no problem if a MINOR ticket waited for a month until you finished that project you’re currently working on. Now you usually had to react rather quickly, you had to make sure that it did not affect your current project work too much and sometimes you had to tell the product owner that you could not work on his feature because you had to fix a defect. The teams reacted pretty much by ignoring the new SLA scheme, and our SLA compliance deteriorated until in May 2013, the majority of tickets in the backlog were past their SLA due date.
What fixed this for us was raising awareness for this problem with a very simple tool. We have a basic division of teams in the tech department: teams for consumer customer products, and teams for commercial customer products. In the rooms where these teams reside, we put up big screen monitors that showed only two numbers. The consumer teams see the number of tickets that belong to all consumer teams, and the number of these tickets that are beyond their SLA due date; the commercial teams see the same figures for their backlog.
The screen background is green as long as there are no tickets beyond the SLA due date and there are not more tickets than teams - one ticket per team is OK. If the number of tickets is higher than the number of teams, it becomes yellow, and as soon as there is one ticket beyond it’s SLA due date, the screen background becomes red. It is very visible in the rooms, so everyone is aware of the backlog size and the SLA compliance all the time. Putting up this screen started a little competition between the two groups who would have the lowest values, and now we’re at the lowest backlog size in our history, and the best SLA compliance, despite that fact that we have much tighter SLAs.
How we do it in everyday work
Changing the Maintenance team setup was a necessary precondition for getting a grip on the problem, but of course it is not enough. You need an effective way to distribute the tickets over the teams and monitor the state of the tickets in the queue. We’re not fans of big centralized processes at the mobile.de tech department, so the process is supposed to be as lean and distributed as possible. Here’s how it goes.
Every night an automated email is sent to all technologists and product owners reporting all the new tickets that were created the day before. Everybody has the new tickets in their mailboxes when they come to work, and often it is the first mail people read on that day.
Everyday at 10:45 we have a short standup meeting which is attended by one envoy of every team. The collective looks at the new tickets and distributes them to the teams (if they haven’t been pulled by one of the teams yet). We also talk about the tickets that will pass their SLA due date soon and ask for an ETA. If a team sees that it will have problems fixing a ticket in time because of some kind of resource problem, it can bring it to this meeting and ask some other team to take over. We also look at tickets that might have passed their SLA due dates and ask what the problem is.
The rest of the process is basically up to the teams. The only requirement is that the team has to make every effort to make sure the ticket is solved in the SLA period.
The team closes the ticket when the fix has been rolled out in production. It should be pointed out that you need to be able to deploy changes very quickly to production if you want to excel here. If you have to adhere to some weekly or even less frequent release schedule, you can fix the defects as fast as you want, but most of your lead time will be beyond your control. We have a continuous deployment mechanism that allows us to deploy right after the fix was verified, so lead time is totally within the team’s control.
While we have made great progress in the last years, there is still room for improvement. If you look at the charts you see dramatic changes in the backlog and lead time charts, but while there is also some improvement in the new ticket chart, we are still producing many tickets every month. The fact that we process the tickets very speedily should not make you believe that there is no effort attached. In fact, my educated guess is that the teams spend about 20% of their time with defect fixing. That’s a lot of time they could be spending on feature projects or slack time instead. In the future, we want to focus on reducing the number of new tickets we produce without putting substantially more effort in testing; that would just shift efforts, but not really reduce them.