How to build a modern and effective incident management process

The emergence of new technologies, such as mobile devices in the professional world, IoT, or the digitization of businesses, has revolutionized how people interact with information and technology.

This evolution has increased the need for CIOs and CTOs to ensure that the services they are providing to their customer remain undisrupted. This necessity has accelerated a wide adoption of IT service management best practices, not to name ITIL, VeriSM, etc…, to help them put in place robust processes to manage their services efficiently. And provide their internal and external customers a seamless experience.

Incident management is one of the most critical core IT support processes. It helps IT organizations restore service outages efficiently and ensure business continuity.

Below are a few tips for building effective incident management within IT organizations.

Categorization and prioritization

Depending on whether an issue affects one user or several users, the impact on production is not the same. Thus, it is essential to establish from the very beginning criteria by which support teams can quickly identify, categorize, and determine whether it is an isolated or major incident.

Regularly IT events are mistakenly defined, which deviates IT organizations from handling it efficiently, which can drive substantial business impacts. Hence, to avoid any confusion, IT managers should take into consideration different criteria while defining their incident management process:

There is no doubt that ITIL essentials have to be defined, for example, urgency, impact, and severity. While for a long time, these processes have been jeopardized because of misinterpretation or human error. Today, an organization can benefit from technological progress to leverage automation capabilities to implement dynamic workflows and almost instantly re-establish disrupted service thanks to scripted processes. Below is a list of processes that can be fully automated for seamless incident management:

Incident identification: thanks to the discovery and monitoring/hyper-monitoring capabilities, incidents can be easily identified by systems that generate adequate alerts and even automatically create related tickets within ITSM tools thanks to SSRS alerts, for example.
Communication process: depending on the severity of incidents, different alert systems can be defined to simultaneously inform impacted users, technical teams in charge of resolving the issue, and the governance team for further incident analysis and reporting.
Assigning: Considered a critical step in incident management, I have personally suffered from this issue during my previous experiences managing support teams as wrong incident assigning results in breaching SLAs, production loss, and user dissatisfaction. Thanks to workflow automation and artificial intelligence, this aspect can be controlled, and IT managers can ensure that based on the incident criteria and description, it will be assigned to the right team according to its typology and priority in 99,9% of the cases.
Tracking: Most of the time, neglected. While it is essential to focus on incident analysis and lessons learned, IT managers mustn’t underestimate the importance of incident resolution tracking. A regular alert system based on status and escalation level change can help incident managers track outages efficiently and control the communication and remediation process.
Reporting: While this process used to be very time consuming for IT managers, thanks to the tools mentioned above, a multi-channel report can be automatically scheduled to be generated upon incident closure, enabling incident managers to analyze all aspects of an outage with a perfect chronology, which helps them from one hand put in place the right preventive action plan to avoid such an issue in the future and to communicate an accurate report to governance team within expected deadlines.

Resource management

Ensure that roles and responsibilities are clearly defined and that the right resources are implemented to work on each incident, depending on its severity and typology. Determining roles and responsibilities initially in the incident management process improves how such issues are handled and reduces the impact IT events might have on businesses. An IT service manager’s main target must be to keep resources engaged and to build a strategy that avoids his/her organization conflicts of lack of time, resource availability, and competencies needs. Based on the severity of an incident, three organizations might be engaged in an incident management process.

Low and medium impact incidents: These types of incidents mainly concern business as usual issues encountered by users or happening on IT equipment. Traditionally handled by L1 or L2 teams (most of the time organized as service desk) helped by described procedures to follow, nowadays, these incidents can directly be taken in charge by self-help services put at the disposal of users through support portals or chatbots, for example. Incident resolutions can also be made transparent thanks to automatic self-heal capabilities or predictive maintenance. While putting in place organizations to manage such incidents can be considered obsolete thanks to automation capabilities enabling the treatment of such events at the back-end level, IT managers must describe and detail these actions within their process.
Critical incidents: These types of incident resolutions are mainly carried by teams of experts and service owners that can perform advanced troubleshooting and resolve unexpected errors. Determining the knowledge within an organization reduces incident resolution time and avoids companies losing valuable time looking for the right person to look after an outage.
Major outages: These incidents do not happen regularly. However, it is crucial that IT service managers consider these events while designing their services. Preparing IT organizations to face such events by previously determining the people who can be part of the task forces and empowered to make decisions in those cases is a key to avoid huge losses and major disruptions.

The incident is resolved great! yet the future must be secured

Resolving an incident does not mean that the issue is over. Every successful IT manager must keep at heart the post-incident resolution period. Performing a root cause analysis enables organizations to understand the reasons that led to an outage occurrence and Then implement organization-wide changes and strategies to prevent similar incidents in the future.

Knowledge management

One of the everlasting problems that organizations face is knowledge management. Throughout my experience, I have been confronted or have met people who were confronted with a situation where we have to manage an outage happening on a service we know nothing about because the person who has developed it has left the company. Articulating an information base editorial template that captures service architecture, critical details, bug resolution, or most frequently encountered issues and their resolution can be lifesaving in some critical situations. Hence, developing a knowledge-sharing culture, problem tracking, and regular documentation update and validation is as critical as the incident management process itself.

KCS framework developed initially for the industrial field is one of the most efficient knowledge management frameworks I have met and that I recommend to fellows (I have no affiliation relationship with KCS organizations).

To Conclude

Incidents and outages are unavoidable. While organizations must put significant efforts to avoid incidents, it is also crucial to them to be prepared for all eventual possibilities and make every encountered issue an opportunity to improve their learning curve and grow service maturity. Mastering these tips could be an initial step towards building a bulletproof IT incident management process.

Footer