Years ago, in a discussion around the need for service management, a peer of mine reminded me that most companies were trying to deal with service issues on the fly — like fixing the airplane while it was still in the air. The plane may eventually come in for a soft landing but if the team hasn’t taken the time to discover the root cause of the incidents, the danger persists — and landings may turn into crashes.
Many companies face this situation when focusing on stabilization. And, in today's world, system outages not only cause downtime for customers and staff but have a real impact on the reputation of the organization. Taking the time to put four major service management concepts into place helps strengthen and mature organizations and, even more importantly, reduces the impact when things go wrong. The core to success starts with understanding what is happening in your environments, being able to react promptly, and understanding the root cause to prevent reoccurrence.
"Taking the time necessary to understand your environments, implementing accurate and meaningful monitoring, and prioritizing problem management are the best ways to minimize the chances for hard landings"
Beginning with the basics always sounds simple, but many of the organizations I speak with struggle with that first step — understanding what’s going on. Just as a pilot has a checklist to verify before taking off, an up-to-date configuration management database (CMBD) is the foundation for a successful service management program getting off the ground. Although many service management programs start without this step, it is hard to measure anything without knowing your environment. Understanding the interdependencies of the systems and documenting them so all teams have quick access to the information simplifies troubleshooting and increases the capabilities for the next step.
The second step for success is monitoring. Many organizations have monitoring systems but haven’t tied that monitoring back to the CMDB. What needs to be monitored? Has a recent change impacted how or what things we should monitor?
Just as a plane needs regular inspections to reduce failure risks, monitoring systems need consistent evaluation and care. One fundamental rule is that if an incident occurs but there was no prior alert, service management teams should evaluate what additional monitoring is needed.
A warning: Don’t create monitors or alerts if you aren’t going to define who is responsible for receiving that alert, who is the owner of the alert, and at what level of incident that alert should be generated. Many organizations make the mistake of creating millions of monitors that generate unhelpful noise, making all the monitoring systems ineffective and unusable. Only monitor for incidents you know you will respond to.
Document each monitor and whose responsibility it is to review, own, and react to each alert. Every alert owner should be required to review and approve their alerts every six months to stay current, which will increase the monitors’ and alerts’ effectiveness. Taking the time to build out solutions correctly and having alerts go directly to the team that is responsible for resolving the issue saves time and reduces impact to the organization.
The third step for a successful flight into a mature service management program is creating a problem management process, which means finding root causes. While getting to the root causes of issues is the key to environment stability, this is one of the hardest things to accomplish because doing this takes time and effort from some of the organization’s most valuable resources.
It is essential that teams are required to spend the time necessary to understand root causes, so they can stop the incident before it occurs. In a company where there is fire after fire, it’s problem management that takes the most effort. But it also offers the biggest payoff — determining root causes can prevent the fires from igniting in the first place.
The fourth and most essential — and often overlooked — step for a smooth service management flight is the flight plan itself. What gets mapped and measured gets focus. Define success and measure it in a format that’s easy to understand.
Each of these steps is critical to a successful takeoff. Investing in service management won’t make sense if you don’t have a clear flight path or if you and your team have no way of figuring out whether you’ve landed in the right spot. Taking the time necessary to understand your environments, implementing accurate and meaningful monitoring, and prioritizing problem management are the best ways to minimize the chances for hard landings.