How I learned to stop worrying about incidents and love on-call
In early 2014, Atlassian started the process of rebuilding our core products as modern “cloud native” applications. This was a big change that required careful planning and execution across most of the company. One complicated issue was figuring out how to streamline user management across our SaaS products.
Eventually we were ready. As we rolled the new SaaS change out to all customers, things looked good at first… but then an “interesting” support case came in. A customer reported that their users were able to see content that they shouldn’t have had access to. Upon investigation, we realized that in some edge cases the rollout had reconfigured customers’ groups in a way that removed some existing access controls. This allowed some of those customers’ end-users to access content that was supposed to be off-limits to them.
RELATED CONTENT: Top roadblocks to securing web apps
We immediately recognized this as a very serious problem and jumped on it right away. Although it affected a very small number of customers, we treated this problem as our highest priority given its nature. It was a complex issue that required weeks of work from many teams, entailed several all-nighters, and pulled in staff from engineering to public relations and everyone in between. Existing plans were trashed, vacations cancelled, and meeting rooms cordoned off as “war rooms” for weeks on end.
Once the incident was resolved and the dust had settled, we started asking questions: What went wrong that led to this? How could we have prevented it, or improved our response? Why haven’t we learned from similar incidents in the past? The answers to these questions led us to codify the Atlassian incident and postmortem processes that we are publishing as a book.
Many organizations discover the need for incident and postmortem processes in a similar way: a major incident goes badly, and they resolve to never let it happen again. In this situation, it’s tempting to design an incident process with the goal of completely eliminating incidents. After all, incidents are things to be avoided, right? If we had a choice, we wouldn’t have them at all, so why shouldn’t we try to prevent them all from happening?
In an attempt to eliminate incidents, organizations often introduce safety gates, checkpoints, and other protective measures to their software development process. They try to prevent incidents by introducing things like change review boards and weekly release meetings with the intention of carefully scrutinizing every change so no bugs are allowed to slip through. This is an understandable reaction to incidents, but it’s not the right one.
Change gates and checkpoints slow down the organization’s rate of change, which does tend to reduce the rate of incidents in the short term, but more importantly, reduces the innovation momentum as well. Backlogged changes pile up behind the gates you constructed, leading to bigger batches of change less frequently, which makes them more (not less!) risky. The net effect is that the company is unable to make changes as fast as before, there is more overhead, and incidents still happen. The pendulum has swung from unawareness to overcaution. Frustration en masse ensues.
The desire to reduce the impact of incidents is correct, but aiming for zero incidents is a mistake. The only way to completely prevent incidents is to stop all changes. But organizations need to make changes in order to compete and survive, and all change entails risk. Thus, we aim to reduce risk in ways that let us continue to pursue progress. It’s the same reason we invented air bags and smoke detectors rather than stop making cars and buildings.
The post How I learned to stop worrying about incidents and love on-call appeared first on SD Times.
Tech Developers
No comments