×

注意!页面内容来自https://www.atlassian.com/incident-management/on-call/it-alerting,本站不储存任何内容,为了更好的阅读体验进行在线解析,若有广告出现,请及时反馈。若您觉得侵犯了您的利益,请通知我们进行删除,然后访问 原网页

Opsgenie’s alerting and on-call features are now available in Jira Service Management and Compass. Migrate existing Opsgenie data and configurations before April 5th2027 using our automated migration tool.

What is IT incident alerting?

Incident alerting is when monitoring tools generate alerts to notify your team of changeshigh-risk actionsor failures in the IT environment. 

For examplea system built to allow doctors to prescribe medication may generate an alert if the dose a doctor requests is unusually highnot matching up with the body weight listed in a patient fileor poses a drug interaction risk with other common medications. 

Similarlya system built to monitor a tech product may generate an alert if a system goes offlineweb requests are taking longer than usual to processor database latency slows beyond a set threshold.  

The goal of IT alerting is to quickly identify and resolve issues that impact product uptimespeedand functionality—around the clock and without manual monitoring. 

Why is IT alerting important?

As the importance of always-on systems continues to riseso too does the cost of downtimewith experts estimating an average cost between $5,600 and $9,000 per minute. Since every minute of system failure is so priceyidentifying issues before they get out of hand has a big impact on the business bottom line (not to mention IT teams’ schedules and stress levels). 

IT alerts are the first line of defense against system outages or changes that can turn into major incidents. By automatically monitoring systems and generating alerts for outages and risky changesIT teams can minimize downtime—and the high cost that comes with it.

Alerting best practices

IT alerts are undeniably an important part of incident managementbut the truth is that they’re not just a simple fix you can set and forget. Setting alert thresholds too low can lead to overflowing inboxesunhappy on-call teamsand alert fatigue. Setting thresholds too high can mean missing critical issues and costing the company millions. 

Which is why the most effective IT alerting systems are set up with these best practices in mind.

Automate your monitoring

The best way to quickly and effectively identify issues is to automate monitoring

Is a database responding slower than usual? Are users experiencing slower-than-average load times on your app? Is a vital system down? Has one of your technicians made a request that seems like a red flag? Your system should automatically be watching out for problems like these and letting you know when they arise.

Set smart alerting thresholds

Does every alert need immediate attention? For most companiesthe answer is no—which is why you need to set sensible alert thresholds. 

Knowing whether something is worth waking a developer in the middle of the night—or if it can wait until morning—can be the difference between happy developers with fast response times and alert-fatigued teams who spend their weekends looking for a new job.

De-duplicate your alerts

A study on alert fatigue found that—for clinicians in a hospital setting—alert attention dropped by 30% every time a duplicate alert came in. And it’s likely that the study results would be the same for developers. The more we see the same alertthe less we pay attention to it. Which is why the best practice here is to de-duplicate your alerts and minimize reminders.

Set priority and severity levels

Obviouslysome alerts are more important than others. A website outage is probably going to take precedence over a brief slow-down on an infrequently-used feature. Malicious hacking is probably a higher priority than an image that isn’t rendering correctly in your app. 

Not only should your system recognize alert priority and severitybut it should also communicate that priority clearly to the people responsible for resolving incidents. The best practice here is to use visualaudibleand sensory cues to quickly and clearly indicate what teams should focus on next.

Make alerts actionable

Knowing what’s wrong is good. Knowing what to do next is better. Which is why if your alerts aren’t actionablethey should be. 

This is one place where DevOps teams can learn from the aviation industry. When an alert shows up on pilot’s dashboard during a flightit comes with an actionable checklist. Building this kind of detail into your alert system cuts down on diagnostic time and helps developers move quickly through your process.

This is especially helpful when a developer is up in the middle of the nightbleary-eyed and not at the top of their game.

Choosing the right alerting technology

Developing an IT alerting system that follows these best practices means being strategic about alerts up front. It also means choosing the right technology to do so. When choosing a vendorwe recommend looking for:

Multiple alerting channels

Email is often the channel of choice when it comes to alerts. But the truth is that email doesn’t always cut it. For urgent alertsyou may want or need SMSmobile push notificationsor even voice calls. Look for a system that allows you to alert in a variety of ways.

Alert enrichment

Actionable alerts are detailed alerts. Which means a short text message isn’t always enough. Beware of strict character limits and look for technology that lets you attach chartslogsrunbooksand checklists to provide additional context to an alert and let the developer know what they should do next.

Custom alert actions

Most alert technology will let you add a note to your alert or close it out. But sometimes there are steps in between. Like escalating the alert for further investigationcreating a service ticketor restarting a server. Look for tech solutions that let you do more than just open and close.

Automated actions

For some alertswhat to do next is complicated and requires an experienced developer’s insight. For othersthe way forward is clear. 

For alerts with clear next steps—diagnostic testsremedial actions—you’ll want a system that triggers those responses automatically in response to an alert that meets your predefined criteria. 

For exampleif a database slowsperhaps you set your alert system to automatically switch to a backup database. If the first step in fixing Issue A is always to restart a servermaybe you set your alert system to restart the server and monitor the result before sending out a middle-of-the-night alert. 

Alert customization and classification

As alerts come inyour team should be able to organize themtag them with additional infoand filter them.

Alert lifecycle tracking

In your incident postmortemyou’ll want to know when the alert came inwho received itwhen they saw itand what action was taken. Make sure any technology you choose automatically tracks these details. It’ll make it simpler to understand what is and isn’t workingimprove your KPIsand document past incidents so that on-call teams can learn from them and refer back to those learnings for future incidents.

Alert and notification policies

If the best practice here is setting intelligent thresholds for your alerts and making sure minor issues aren’t waking your developers in the middle of their REM sleepyou need technology that lets you suppressdelayand expedite alerts based on their content and timing.

Real-time monitoring for your monitoring

How do you knowat any given momentthat your alert systems are up and running? 

The answer—with the right technology—should be that the tech has its own monitoring system. With OpsGeniewe do this with a tool called Heartbeatswhich continuously checks that monitoring tools are active and connected and custom tasks are completed on schedule. If the signal goes downthe system alerts you instantly.

Recommended for you

TUTORIAL

Setting up an on-call schedule with Opsgenie

In this tutorialyou’ll learn how to set up an on-call scheduleapply override rulesconfigure on-call notificationsand moreall within Opsgenie.

A better approach to on-call scheduling

An effective on-call schedule is key to sustaining a healthy on-call culture. Learn common mistakestypes of rotation schedulesand how to get it right.

Learn more about Incident Management

Find more Incident Management guides and resources in this hub.