Root Cause of Success
Like most companies, we do root cause analysis when things go wrong. “Root cause” is a bit of a misnomer, we deal with complex systems, usually with different level of redundency, so having a single root cause is usually not really realistic; really they are more like post mortems. In any case, when we have an incident, it’s important to review what went wrong; gathering logs, graphs, and other data; to try to learn why the assumptions we made did not manifest as we thought, and to determine what changes we might need to make for the future. This cycle of review and learning is critical for continued success.
This past weekend, the OmniTI operations folks went through a number of significant production excursions, most of which were pulled off with good success. After which, we didn’t do a post mortem. This probably isn’t too different from most shops; I think most people don’t do a post mortem when things work. We probably should. Even when things work, there are usually suprises along the way, and if you only decide when to do a in-depth look back on when things fail, you’re probably overlooking use cases and scenarios you are likely to encounter again. Additionally, it’s good information for people to be able to review, especially when bringing on new hires. You might think this would be boring, but I happen to love reading a well written post mortem. You probably do to, you just don’t think of something like Apollo 13 as a giant post mortem, but for the most part that’s what it is.
So I’m curious, are there shops where people do regular detailed accounting when things go right? Not just having audit trail information around, but walking through those logs as a group and talking out loud abut the areas that were more hope than plan, but since it worked everyone feels confident in. I know a lot of different people running web operations, but this doesn’t seem like a common practice; if you’ve worked in such an environment, I’d love to hear about your experiences.