- Incident management frameworks help organizations manage chaos during production outages and resolve incidents faster.
- The incident management lifecycle can be divided into 7 stages: Detect, Create, Classify, Troubleshoot, Resolve, Review, and Follow Up.
- Incident management processes require a dedicated set of folks performing different roles and responsibilities including incident managers, communications leads, on-call engineers, customer escalation managers, and executives.
- Dedicated incident management tools (Alert Management, On-call Management, Project Management, Incident Tracking, and Collaboration) enable efficient resolution of incidents and reduce toil.
- It is essential to track and measure key incident metrics, such as TTD, TTR, TTM, and Availability, to continuously evaluate the progress in the organization.
In this article, we provide an opinionated generic framework for effective incident management inspired by LinkedIn’s internal process that can be tailored to fit the needs of different organizations. There are standardized ITIL processes for Incident Management, but the following framework differs from that and is customized to resolving live production outages.
Most companies offer services online, and any outages entail poor end-user experience. Repeated outages can impact the business and brand value. Frequent production outages are expected in complex distributed systems with high velocity. Organizations should embrace the reality of incidents and create an incident management process to facilitate faster resolution times.
What are Incidents?
Incidents are unplanned production outages that significantly disrupt the end-user experience and require immediate organized intervention.
Data Quality Fundamentals - Download the eBook (By O'Reilly)
Related Sponsored Content
Join us for IMPACT 2022: The Data Observability Summit, spotlighting the industry’s most prominent data leaders paving the way forward for reliable data.
Incidents can be internal or external based on the impacted users.
- Internal Incidents - Outages that impact employee productivity due to issues within tools that are used to get their job done can be termed internal incidents (e.g., deployment tooling is not functioning for an extended duration, employees cannot log into the VPN).
- External Incidents - Outages impacting the end-user experience of a company’s products/services are termed external incidents (e.g., users cannot purchase items from an e-commerce website, and users are not able to send messages in messaging software).
The above incidents can be further divided based on severity into Minor, Medium, and Major.
- Major - Severely impacts the end-user experience for many users, and there is a clear impact on business due to revenue loss or brand value.
- Medium - Incidents impact a significant part of the service but are usually localized to a specific region, unlike major incidents.
- Minor - Incidents that impact the non-critical workflow of the service for a select few percentages of the users.
Consider a hypothetical example of the severity of incidents on a social media website. The service being unavailable for most users for more than 30 minutes can be classified as a major incident. In contrast, the direct message feature not working for users in the Middle East might be a medium, and the verified badge not showing up on users’ profiles for users in Indonesia might be classified as a minor outage.
It is highly recommended to consider business goals and establish strict data-based guidelines on the incident classification to promote transparency and prevent wasting engineering bandwidth on non-critical incidents.
What is Incident Management?
Incident management is the set of actions taken in a select order to mitigate and resolve critical incidents to restore service health as quickly as possible.
Incident Management Stages
Outages are proactively detected via monitoring/alerts set up on the infrastructure or by user reports via various customer support channels.
Incidents are created for the detected outages triggering the initiation of the incident management process. Ideally, an organization can rely on a ticket management system similar to Atlassian’s JIRA to log incident details.
Incidents are then classified based on the established guidelines. It is highly recommended to draft these guidelines in alignment with business needs. There are multiple terminologies used across the industry today, but we will stick to the major, medium, and minor categorization to keep it simple. The incident management process and sense of urgency remain the same for all incidents, but identifying incidents helps prioritize when multiple incidents are ongoing simultaneously.
The incident is escalated to the oncall engineers of the respective service by the person who initially reported the incident to the best of their knowledge after consulting the internal on-call runbook. Escalations continue until the root cause of the issue is identified; sometimes, an incident may involve multiple teams working together to find the problem.
As their highest priority, the teams involved focus on identifying the steps to mitigate the ongoing incident in the shortest amount of time possible. The key is to take intelligent risks and be decisive in the following steps. Once the issue is mitigated, teams focus on resolving the root cause to prevent the recurrence of the problem. Throughout the resolution process, communication with internal and external stakeholders is essential.
The incident review usually takes place after the root cause identification. The team involved during the incident and critical stakeholders get together to review the incident in detail. Their goal is to identify what went wrong, what could be improved to prevent or resolve similar issues faster in the future, and identify short/long-term action items to prevent or improve the process/stack.
Incident action items are reviewed regularly at the management level to ensure all the action items related to the incidents are resolved. Critical metrics around incidents, such as TTD (Time To Detect), TTM (Time to Mitigate), TTR (Time To Resolution), and SLAs (Service Level Agreement), are evaluated to determine incident management effectiveness and identify the strategic investment areas to improve the reliability of the services.
Incident Management Roles & Responsibilities
A dedicated set of folks trained to perform specific roles during the incident is essential to successfully manage production incidents with minimum chaos. Ideally, people assume one function as the responsibilities are substantial and require particular skills. Roles can be merged and customized to fit the business needs and the severity of the incidents.
The Incident Manager, referred to as IM for brevity in the document, is the person in charge of the incident, responsible for leading the incident to resolution with the proper sense of urgency. During an incident, a person should be responsible for the general organization of the incident management process, including communication and decisions. This person will be empowered to make decisions and ensure incidents are handled efficiently according to strategy.
The Incident Manager is responsible for four main aspects of incident management: organization, communication, decision management, and post-incident follow-up.
- The organization of an incident is paramount to efficient resolution.
- The IM will be responsible for pulling in the correct teams and stakeholders to ensure a quick resolution.
- The IM will work with stakeholders to ensure that work items raised during investigation and remediation are assigned and tracked.
- During an incident, many decisions need to be made.
- The IM is responsible for identifying inflection points between investigation and quick resolution and ensuring that decisions are made promptly and appropriate stakeholders are engaged/aware.
- The IM is empowered to judge who owns decisions when consensus cannot be reached during troubleshooting.
- After an incident, the IM is the communications point of contact for the incident.
- As the IM was actively involved in the incidents, they are responsible for leading the post-mortem in collaboration with service owners and stakeholders.
- IM will collaborate with service owners and present the incident overview and essential action items from the post-mortem to higher management.
During an active incident, on-call engineers from impacted services and owning services are engaged to investigate and mitigate the issues responsible for the incident.
On-call engineers from affected services are responsible for evaluating the customer impact and service impact and validating the mitigation/resolution steps before giving the all-clear signal to close the incident.
Owning on-call engineers accountable for the service causing the outage/issues are responsible for actively investigating the root cause and taking remediation steps to mitigate/resolve the incident.
Effective communication between stakeholders, customers, and management is critical in quickly resolving incidents. Dissemination of information to stakeholders, management, and even executives avoids the accidental compounding of incidents, helps manage chaos, prevents duplicate/siloed efforts across the organization, and improves time to resolution.
The Communications Manager is responsible for all the written communications of the incident to various internal and external stakeholders (employee & executive updates, social media updates, and status pages)
Customer Escalation Manager
In large companies that cater to a wide variety of enterprise customers with strict SLA requirements, it is common to have dedicated Customer Escalation Managers to bridge the communication between the customers and internal incident teams.
- Stay in contact with customers, collect details about ongoing incidents and relay the information to the internal team debugging the issue.
- Distill communication updates from the Communications Manager and regularly pass customized updates to customers.
- Identify mitigation steps for customers to try and mitigate until the full resolution of the issue is put in place.
Executives responsible for the services causing the customer impact are constantly updated on the incident status and customer impact details. Executives also play a crucial role in making decisions about the incident that may impact the business, routing resources to speed up the incident resolution process.
Incident Management Tools
Many tools are required at each stage of the incident management lifecycle to mitigate issues faster. Large companies roll out custom-built tools that interoperate well with the rest of the ecosystem. In contrast, many tools are available in the market for organizations that don’t need to build custom tools, either open-sourced or commercial. This section will review a few standard categories of essential tools for the incident management process.
Alert management helps set up alerts and monitor anomalies in time series metrics over a certain period. It sends notifications to on-call personnel to inform them of the abnormality detected in the operational metrics. Alert management tools can be configured to escalate the reports to on-call engineers via multiple mediums; a pager/phone call for critical and messages/email for non-critical alerts.
Alert management tools should support different mediums and the ability to interop with the observability tools such as Prometheus, Datadog, New Relic, Splunk, and Chronosphere. Grafana Alert Manager is an open-sourced alert management tool; PagerDuty, OpsGenie, and Firehydrant are some of the commercial alert management tools available in the market.
In a large organization with thousands of engineers and microservices, engaging the correct person in a reasonable amount of time is crucial for resolving incidents faster. On-call management tools help share on-call responsibilities across teams with on-call scheduling and escalations features and service to on-call engineers mappings to enable seamless collaboration during large-scale critical incidents.
On-call management tools should support customizations in scheduling and service ownership details. PagerDuty and Splunk Oncall are some of the most well-known commercial options, whereas LinkedIn’s OnCall tool is an open-sourced version available for organizations looking for budget options.
It is not uncommon to have hundreds of employees engaged during critical incidents. Collaboration and communication are essential to manage chaos and effectively resolve incidents. These days, every software company has messaging or video conferencing software that engineers can readily use to hop on a bridge and collaborate. Easy and fast access to information on which groups in messaging applications to join or which bridge to participate in the video conferencing software is critical in reducing the time to resolve incidents.
A separate channel for every incident discussion is vital to enable easier collaboration. Bridge links are usually pinned to the group chat’s description for new engineers to join the meeting. A well-established process reduces the noise of logistical questions such as "where should I join" or "can someone please share the bridge link" in the group chat and keeps the communication channel clear for troubleshooting.
Incidents generate vast amounts of critical data via automated processes or manual scribing of the data for future reference. Classic note-taking applications won’t go too far due to a lack of structure. A ticketing platform that supports multiple custom fields and collaboration abilities is a good fit. An API interface to fetch historical incident data is crucial.
Atlassian’s JIRA is used by many companies for all incident tracking, but similar tools such as Notion, Airtable, and Coda work equally fine. Bugzilla is an open-sourced alternative that can help with incident tracking.
Knowledge-sharing tools are essential for engineers to find the correct information with ease. Runbooks, service information, post-mortem documents, and to-dos are all part of the knowledge-sharing applications. Google Docs, Wikis, and Notion are all good commercial software that helps capture and share knowledge within the organization.
Status pages are a medium to easily broadcast the current status of the service health to outside stakeholders. Interested parties can subscribe to the updates to know more about the incident's progress. Status pages reduce inbound requests to customer service departments regarding the system's health when an external incident occurs.
Incident Response Lifecycle
In the last sections, we discussed different stages, roles, and tools in incident management. This section will use the above information and detail the incident response process stages.
Issues are detected by internal monitoring systems or by user reports via customer support or social media. It is not uncommon for internal employees to see the issue first and escalate it to the centralized site operations team. Organizations should adopt reasonable observability solutions to detect problems faster so that Time To Detect (TTD) metrics are as small as possible.
In case of user escalations, a process should be implemented for employees to quickly escalate the issues to the relevant teams using the available on-call management tools. Escalation of issues marks the beginning of the incident management lifecycle.
The team collects the required information about the incidents and creates an incident tracking ticket. Additional details about affected products, start time, impacted users, and other information that may help engineers troubleshoot should also be captured.
Once the ticket is created, the on-call Incident Manager needs to be engaged using the internal incident management tool. A shared channel for communications in the internal messaging service and a video bridge for easy collaboration should be started.
The Incident Manager works with the team to identify the on-call engineers for impacted services and collaborates with them to better understand the user impact. Based on the impact, the Incident Manager classifies the incident into major, medium, or minor. Major incidents are critical and would typically be an all-hands-on-deck situation.
Once the issue is classified as a major, a preliminary incident communication is sent out to all relevant stakeholders that a major incident has been declared and noting the available information about the incident. This initial communication lacks details but should provide sufficient context for recipients to make sense of the outage. The external status page should be updated, acknowledging that an issue is in progress and the organization is working on resolving the issue.
The Incident Manager should escalate the issue and engage all relevant on-call engineers based on the best available information. The communications lead will take care of the communications, and the customer escalation manager should keep the customers updated with any new information. The incident tracking ticket should capture all necessary incident tracking data.
If more teams are required, the Incident Manager should engage the respective teams until all the people needed to resolve the incident are present.
Teams should focus on mitigating the incidents and finding the root cause and resolution later. In this case, the teams can explore options to redirect all the traffic from the affected region to available healthy regions to try and mitigate the issue. Mitigating the incident using any temporary means can help reduce the TTM (Time to Mitigate) of the incidents and provide much-needed space for engineers to fix the root cause.
Throughout the troubleshooting process, detailed notes are maintained on things identified that may need to be fixed later, problems encountered during debugging, and process inefficiencies. Once the issue is resolved, the temporary mitigation steps are removed, and the system is brought to its healthy state.
Communications are updated with the issue identified, details on steps taken to resolve the problems, and possible next steps. Customers are then updated on the resolution.
Once the root cause is identified, a detailed incident document is written with all the details captured during the incident. All stakeholders and the team participating in the incident management get together and conduct a blameless post-mortem. This review session aims to reflect on the incident and identify any technology or process opportunities to help mitigate issues sooner and prevent a repeat of similar incidents. The timeline of the incidents needs to be adequately reviewed to uncover any inefficiencies in the detection or incident management process. All the necessary action items are identified and assigned to the respective owners with the correct priorities. The immediate high-priority action items should be addressed as soon as possible, and the remaining lower-priority items must have a due date. A designated person can help track these action items and ensure their completion by holding teams accountable.
Metrics to Measure
As it is said in SRE circles: "what gets measured gets fixed." The following are standard metrics that should be measured and tracked across all incidents and organizations.
Time To Detect (TTD)
Time to Detect is the time it takes to detect the outage manually or via automated alerts from its start time. Teams can adopt more comprehensive alert coverage with fresher signals to detect outages faster.
Time To Mitigate (TTM)
Time To Mitigate is the time taken to mitigate the user impact from the start of the incident. Mitigation steps are temporary solutions until the root cause of the issue is addressed. Striving for better TTM helps increase the availability of the service. Many companies rely on serving users from multiple regions in an active-active mode and redirecting traffic to healthy regions to mitigate incidents faster. Similarly, redundancy at the service or node level helps mitigate faster in some situations.
Time To Resolution (TTR)
Time to Resolution is the time taken to fully resolve the incident from the start of the incident. Time to Resolution helps better understand the organization’s ability to detect and fix root causes. As troubleshooting makes up a significant part of the resolution lifecycle, teams can adopt sophisticated observability tools to help engineers uncover root causes faster.
Key Incident Metadata
Incident metadata includes the number of incidents, root cause type, services impacted, root cause service, and detection method that helps the organization identify the TBF (Time Between Failures). The goal of the organization is to increase the Mean Time Between Failures. Analyzing this metadata helps identify the hot spots in the operational aspect of the organization.
Availability of Services
Service availability is the percentage of uptime of service over a period of time. The availability metric is used as a quantitative measure of resiliency.
This article discussed the incident management process and showed how it can help organizations manage chaos and resolve incidents faster. Incident management frameworks come in various flavors, but the ideas presented here are generic enough to customize and adapt in organizations of any size.
Organizations planning to introduce the incident management framework can start small by collecting the data around incidents. This data will help understand the inefficiencies in the current system or lack thereof and provide comparative data to measure the progress of the new incident management process about to be introduced. Once they have a better sense of the requirements, they can start with a basic framework that suits the organization's size without creating additional overhead. As needed, they can introduce other steps or tools into the process.
If you are looking for additional information on improving and scaling the incident management process, the following are great places to start:
- Anatomy of an Incident - Ayelet Sachto, Adrienne Walcer
- Incident Management for Operations - Rob Schnepp, Ron Vidal, Chris Hawley
- Atlassian's Incident Management Handbook
- SREcon21 - Evolution of Incident Management at Slack
Organizations looking to improve their current incident management process must take a deliberate test, measure, tweak, and repeat the approach. The focus should be on identifying what’s broken in the current process, making incremental changes, and measuring the progress. Start small and build from there.
About the Author
Anil Kumar Ravindra Mallapur
Show moreShow less
Which is the most important for effective incident management? ›
- Focus on the end user. ...
- Plan for typical and atypical cases. ...
- Simplify incident reporting and logging. ...
- Categorize all incidents. ...
- Prioritize based on impact. ...
- Escalate to the right people. ...
- Communicate incident status frequently. ...
- Automate repetitive tasks.
The key to effective incident management is establishing the right processes and using solutions that empower organizations to respond, resolve, and learn from every incident proactively.How do you develop an incident management process? ›
- Incident Identification, Logging, and Categorization. ...
- Incident Notification & Escalation. ...
- Investigation and Diagnosis. ...
- Resolution and Recovery. ...
- Incident Closure. ...
- Train and Support Employees. ...
- Set Alerts That Matter.
- Incident Detection. You need to be able to detect an incident even before the customer spots it. ...
- Prioritization and Support. ...
- Investigation and Diagnosis. ...
- Resolution. ...
- Incident Closure.
The goal of incident response is to enable an organization to quickly detect and halt attacks, minimizing damage and preventing future attacks of the same type.How can I improve my incident management skills? ›
- Create Teams with the Right Skills. ...
- Clearly Define Your Incident Management Vocabulary. ...
- Establish Communication Channels. ...
- Cultivate a Blameless Culture. ...
- Practice Your Incident Response. ...
- Don't Skimp on the Postmortem. ...
- Get Help from Automation.
The purpose of the Incident Management process is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, ensuring that agreed levels of service quality are maintained.What is incident management and why is IT important? ›
The speed with which an organization can recognize, analyze, prevent, and respond to an incident will limit the damage done and lower the cost of recovery. This process of identifying, analyzing, and determining an organizational response to computer security incidents is called incident management.How do you manage incident management? ›
- Identify an incident and log it. An incident can come from anywhere: an employee, a customer, a vendor, monitoring systems. ...
- Categorize. Assign a logical, intuitive category (and subcategory, as needed) to every incident. ...
- Prioritize. Every incident must be prioritized. ...
- Start with an executive or board-level support. ...
- Pull in external experts for help. ...
- Assemble the team with representatives from across the organization. ...
- Name a leader and define clear roles and responsibilities for team members.
What is the incident response process? ›
The incident response process includes identifying an attack, understanding its severity and prioritizing it, investigating and mitigating the attack, restoring operations, and taking action to ensure it won't recur.What is the most important phase of incident response? ›
Detection. One of the most important steps in the incident response process is the detection phase. Detection (also called identification) is the phase in which events are analyzed in order to determine whether these events might comprise a security incident.What are the 7 steps in incident response? ›
- Threat Detection.
The Three Elements of Incident Response: Plan, Team, and Tools.What is the key to effective incident accountability? ›
The key to an effective accountability system is the fact that it is being utilized and in a correct manner. The purpose of accountability is to track personnel on the scene of an emergency or routine incident at all times. If properly utilized, it will allow you to know the location of personnel on every incident.How do you handle an incident response? ›
The incident response phases are:
- Lessons Learned.
Priority is the sequence in which an incident or Problem needs to be resolved, based on impact and Urgency. Priority also de- fines response and resolution targets associated with service Level Agreements.How do you create an effective incident report? ›
Effective Incident Reports need to be clearly written. They should be written so a person that is not involved in the incident can understand what happened. Effective Incident Reports identify the facts and observations. They avoid inclusion of personal biases; they do not draw conclusions/predictions, or place blame.How do you conduct an effective incident investigation? ›
- Gather information.
- Search for and establish facts.
- Isolate essential contributing factors.
- Find root causes.
- Determine corrective actions.
- Implement corrective actions.
Incident management — the process of identifying, preventing, responding to, recording, and analyzing workplace health and safety risks — makes workplaces safer for employees and the environment, while also ensuring organizations stay productive and mitigate unnecessary costs.
What is incident management interview questions and answers? ›
- How would you go about leading an incident investigation? ...
- How would you manage a large team of technical staff? ...
- How do you keep up to date with the changing IT industry and new software programs? ...
- Which incident management software systems do you enjoy working with?
An incident is something that happens, often something that is unpleasant.Which tool is used for incident management? ›
The iAuditor software is a common incident management tool that focuses on inspecting and monitoring various systems for potential threats to a company's security, quality control and overall business operations.What is incident management plan? ›
This plan details how the incident will be managed from occurrence to back-to-normal operation and provides information about the structure of the Incident Management Team, the criteria for invoking Business Continuity, the management of the incident, resource requirements, any necessary staff movements and critical ...What's the first step in handling an incident? ›
What's the first step in handling an incident? Detect the incident. Before you can take any action, you have to be aware that an incident occurred in the first place.What is the correct order of the incident response process? ›
Incident Response Phases. Incident response is typically broken down into six phases; preparation, identification, containment, eradication, recovery and lessons learned.What is the first step in an incident response plan? ›
Phase 1: Preparation
The Preparation phase covers the work an organization does to get ready for incident response, including establishing the right tools and resources and training the team. This phase includes work done to prevent incidents from happening.
Incident response planning is important because it outlines how to minimize the duration and damage of security incidents, identifies stakeholders, streamlines digital forensics, improves recovery time, reduces negative publicity and customer churn.What are the top 3 challenges with incident response? ›
One of the main struggle for organizations is also on managing good Runbooks or Standard Operating Procedures(SOPs). It is important to have Pre defined Runbooks for common category or type of incidents. The 3 main challenges are preparation, updation and audit of the Runbooks.What kind of information is most important for an incident response team? ›
Generally speaking, the core functions of an incident response team include leadership, investigation, communications, documentation and legal representation. Leadership.
What are the 4 main stages of a major incident? ›
Most major incidents can be considered to have four stages: • the initial response; the consolidation phase; • the recovery phase; and • the restoration of normality.What are the 8 basic elements of an incident response plan? ›
- A Mission Statement.
- Formal Documentation of Roles and Responsibilities.
- Cyberthreat Preparation Documentation.
- An Incident Response Threshold Determination.
- Management and Containment Processes.
- Fast, Effective Recovery Plans.
- Post-Incident Review.
- Lifting. ...
- Fatigue. ...
- Dehydration. ...
- Poor Lighting. ...
- Hazardous Materials. ...
- Acts of Workplace Violence. ...
- Trips and Falls. ...
- The Basics. Identify the specific location, time and date of the incident. ...
- The Affected. Collect details of those involved and/or affected by the incident. ...
- The Witnesses. ...
- The Context. ...
- The Actions. ...
- The Environment. ...
- The Injuries. ...
- The Treatment.
Training your brain before you find yourself in a high-pressure situation may help you save a life or potentially help someone in pain. There are three basic C's to remember—check, call, and care. When it comes to first aid, there are three P's to remember—preserve life, prevent deterioration, and promote recovery.What is one of the most important aspects of the National Incident Management System? ›
nThe most important Component of NIMS is ensuring that the TEAM knows what the Mission is and how the Goals and Objectives support it. Key elements and features of NIMS include: Incident Command System (ICS).What is priority in incident management? ›
Priority is a category that identifies the relative importance of an incident, problem, or change. Priority is based on impact and urgency, and it identifies required times for actions to be taken. Impact and urgency are used to assign priority.Why is IT an important component of incident management? ›
A well-defined, effective incident management process helps create faster, smoother responses by ensuring that every person on the team understands their role and where to access information. Putting that process in place before you need it ensures that your team can react swiftly and effectively.What are the key components of an incident management system? ›
- Respond to threats.
- Triage incidents to determine severity.
- Mitigate a threat to prevent further damage.
- Eradicate the threat by eliminating the root cause.
- Restoring production systems.
- Post-mortem and action items to prevent future attacks.
What Is Priority? Priority #1) Immediate/Critical (P1) Priority #2) High (P2) Priority #3) Medium (P3) Priority #4) Low (P4)
What is the first priority after an incident? ›
1. Preserve and Document the Incident Scene. An incident investigator's first priority should be to ensure that the incident site is safe and secure. In some situations, you may have to travel a significant distance to reach the place where an incident occurred.How do you manage an incident? ›
- Identify an incident and log it. An incident can come from anywhere: an employee, a customer, a vendor, monitoring systems. ...
- Categorize. Assign a logical, intuitive category (and subcategory, as needed) to every incident. ...
- Prioritize. Every incident must be prioritized. ...
Incident Management enables you to categorize and track various types of incidents (such as service unavailability or performance issues and hardware or software failures) and to ensure that incidents are resolved within agreed on service level targets.What is the value of incident management? ›
The purpose of incident management is to reinstate normal service operations as fast as possible and mitigate the negative impact on business operations, thus making sure that the agreed levels of service quality are maintained.