IT incidents happen to anyone doing business in the digital world. Even the simplest of sites or platforms is actually quite complex. There are so many lines of code (even if you use drag-and-drop development, code is still there, even when you can’t see it) and so many features that can cause a problem at any time.
Every time something changes, it affects everything else, which is why there is so much potential for a problem. Also, with more people participating, there’s even more room for things to go wrong.
You can’t avoid incidents; you can just prepare a better response. With that in mind, here are the top five practices for better IT incident management.
1. Understanding incident levels
The biggest problem with understanding incidents is that the very word incident is a broad term. It could mean a number of things, and the severity of an incident can vary from minor to serious to very, very serious.
To fully understand incident management, you must get an example of various incidents and activities for managing incidents.
For instance, if someone in your office forgets their password, they either need to remember or reset it. This is the so-called level one incident. If they don’t know how to do it, an incident management activity would be one of the members of your IT team getting them on the line and explaining it to them.
A glitch in the software is a bigger problem and a level-two incident. In this scenario, you need to fix the glitch, but unlike the password issue, this is not a problem everyone can resolve. You need someone with the experience and specific skills to work on this software. This is problematic because fewer people can solve the problem, which causes a potential bottleneck.
Lastly, there’s the level three type of incident, where the whole server or your primary platform goes down. No one can log in, no one can work, and your customers can’t access your services. The severity of this problem makes it a top priority, and when this happens, all available hands should be focused on resolving it.
The core of your incident response plan needs to consist of three things:
- Noticing an incident
- Identifying an incident level and origin
- Initiating an appropriate incident management activity
Sure, these have to be executed flawlessly, quickly, and accurately.
2. Making a good incident response plan
The first thing you need to do is state the clear objectives of your incident response plans. Objectively, the plan is always to resolve the problem and resume regular continuation of services; however, every field and industry has types of incidents that are more or less likely to occur.
In web development, we’re usually talking about data breaches, DDoS attacks, loss of data, etc. Each of these incidents has a clear resolution. DDoS attack is over when your services are up and operational. Loss of data is solved when you recover your data (ideally via a backup).
The simplest way to explain the urgency and importance of this is to say that, in order to do anything successfully, you need to know what you’re trying to do. Without it, it would be like attempting to do carpentry by just nailing some boards to each other without a specific piece of furniture in mind.
Another thing you need to do is identify key stakeholders. Whose job is it anyway? In the previous section, we’ve mentioned incident levels. Well, this is usually a question of skill since not everyone can resolve each of these problems. If it’s the first level of incident, almost anyone in your IT department can solve a problem. When it’s second or third, there’s probably only a handful of people who can fix the problem.
Still, you need to specify who.
The thing is that, when it comes to the incident response, your ability to respond to the problem quickly will make all the difference. For this to work, the first person who notices the problem needs to know who to contact. If you’re working with a platform, it needs to notify one specific person.
3. Monitoring and alerting
In the previous section, we’ve already mentioned that solving a problem in time makes all the difference when it comes to resolving it successfully. The sooner you fix the issue, the better. The problem is that noticing these things won’t always be as easy.
Even if you could afford it, you can’t have someone check all the features of your app/platform all the time. Something is bound to slip through. However, if you set up the parameters on the right diagnostic app, pick the right KPIs on your application and make sure to get a notification whenever one of them is off, you’ll have a much easier time keeping track of this.
Now, it’s important that you remember one thing – the range of your KPIs needs to be relevant and realistic. You can’t get a notification every time something fails to respond immediately or every time traffic to your services slightly drops. Anomalies happen, and if you set the notification system this way, it will overwork you and send you notifications all the time. Soon, it will be like with the boy who cried wolf – you’ll see the notification and ignore it, believing it’s just another false alarm.
The most important thing with this is that monitoring must occur in real time. The system is always active; you’re just getting notified every time something’s off. You can even schedule a daily report that you can review, perhaps spotting some smaller anomalies in the process.
In other words, you want to make the system sensitive, but you don’t want it too sensitive. Doing so would be like setting off a house alarm every time a bird chirps two blocks down the road.
4. Patch management
Now, if an incident happened, there are only two possible scenarios:
- A one-time, accidental incident
- A systematic issue that will happen again
This is why the situation is not resolved when you can keep on working like before the event. At this point, the cat might already be out of the bag, and you’re aware of the fact that this can happen again or that it can happen at all. In this scenario, the situation is resolved when you’ve made sure that something like this will never happen again.
We’re talking about uploading a patch that will fix the issue once and for all.
Now, as a web developer, you’ll sometimes be responsible for patching your own app manually. However, suppose you heavily rely on third-party templates, plugins, and widgets. In that case, chances are that you can just set up a system and give permissions for updates to be applied automatically.
Applying these patches is a key step in keeping your website safe from cyber attacks.
Sure, you may still be required to give permission since some of these patches do require terms of service update; however, the update will be scheduled as soon as it’s available, and you’ll receive a notification.
It’s in your best interest to expedite the deployment of critical security patches, which means that you need to get more invested in the latest developments in the field.
Most importantly, you need to introduce a strong patching policy and procedure. This involves a roadmap as well as a timeline for patch deployment. Some of these patches will cause a bit of downtime, so make sure to strategically plan them for the period when you can actually afford something like this.
5. Incident communication
In one of the previous segments, we’ve already discussed the fact that you need to have a plan that involves incident employee hierarchy. Who do they contact when they notice a problem of a specific kind?
However, you also need to address how they make this outreach!
You see, sometimes the communication will take place via an internalized platform, which may also be down (during level one or level two incidents). So, you need to have a contingency plan in the form of a contact list and an alternative messaging platform.
You also want to make this more standardized, so you might want to prepare an incident report template. By just having this template and an address where to send it, you’ll streamline this process and reduce the chance that something will go unreported.
A communication protocol is incredibly important since it also reduces the chance of a slip-up. If the word of an incident slips out, you could land in some serious trouble. Just think about it; sometimes, you can fix the issue before the word gets out. Once everyone starts talking about it, it’s too late, and your reputation might already take a hit.
You can’t avoid incidents; you can just manage them better
The procedure is pretty simple, even though the actual execution takes a lot of finesse. First, you need to understand what types of incidents are out there. Then, you need a response plan. Lastly, you need to introduce a system that will notify you as soon as something happens. The rest are things you’ll have to figure out as it unveils.
By Srdjan Gombar
Veteran content writer, published author, and amateur boxer. Srdjan is a Bachelor of Arts in English Language & Literature and is passionate about technology, pop culture, and self-improvement. His free time he spends reading, watching movies, and playing Super Mario Bros. with his son.