You would think that incidents are a nasty thing in software development (or its lifecycle management). Honestly, in many years I experienced the official incident management as a bigger burden than the incidents themselves! To avoid misunderstandings: This is not about the people working in Operations or someone holding the job title of an Incident Manager. It is about how incidents are actually managed.
It starts with the assumptions (as Daniel Vacanti suggests, make them explicit):
Everyone watches one tool.
Incidents have a clear owner.
Incidents can be assigned to one team (or even one person).
Work on incidents can follow a clear timeline.
There can be a Service Level Agreement (SLA) for the time until an issue is resolved with 100% certainty.
Incidents are mechanical / follow a linar logic / have a clear cause and effect.
If there is a bug introduced in the software, we will see the consequences immediately.
People in the company care after the incident has been resolved.
It is no problem to admit a mistake.
The environment encourages professionalism.
If there is a root cause, it will get fixed.
People are given time for improvements.
If you have ever worked on incidents (like I did for many years), I hope by now you have had some good laughs. Of course, it is actually very sad to read this. People who design processes or buy tools with these ideas in mind are obviously far away from the real action.
What happens in reality are tricky multi-component failures, sleeper bugs that become active only every now and then, and the slow poisoning of a system that only eventually breaks (but then has to be fixed immediately). Communication is disjointed, people look at different tools, triggers and metrics, almost living in parallel universes. They stop to care at different stages. (Here I admit, sometimes it makes sense. Think of an emergency at a hospital.) Once the heat is off, there is no focus on learning and sharing. Trivial causes are not taken care of. People are hesitant to own mistakes because of blame game being played.
Not understanding that not everything is linear or timely is the biggest intellectual offense for me. Watch out for people who do not understand weaker cause-and-effect relationships! You do not want those to steer a company...
It is relatively easy to measure if higher management actually cares about a healthy failure culture: How often are incidents topics at high-level meetings? What is the learning shared? If incidents are treated as "dirty work" that needs to get delegated away, or if after learning a valuable lesson, this is treated as an ashaming experience, then you can be sure a lot can be improved...
Not all hope is lost, though. There are enough people who know how to do a proper job and who want to do it (albeit despite and not supported by their work environment and the existing processes and rules). A couple of weeks ago, I heard some wise and positive words about incidents, which inspired me to create my own variation of them: Incidents are the co-created results of previous choices. They invite us to new choices - if we keep our eyes and ears open and our brain on.
Wisdom Call: Through Fire

