Troubleshooting Duplicate Alerts in Prometheus' Alertmanager

Sidharth ShambuSidharth Shambu
4 min read

At my workplace, we have Prometheus' alertmanager setup with receivers to JIRA automation, AWS Incident Manager, Slack and other webhooks with alerts triggered by PromQL expressions.

I was assigned a task to expose a method that triggers a Slack message to a particular channel. Fairly straightforward task. Alertmanager already has an API for this. I just had to hit this API and write a matcher to ensure the triggered alert goes to the right slack channel.

Alert gets triggered with this request

curl --request POST \
  --url https://alertmanager-instance.company.com/api/v1/alerts \
  --header 'Content-Type: application/json' \
  --data '[
    {
        "annotations": {
            "summary": "Prod down, company in huge loss."
        },
        "labels": {
            "alertname": "A very serious alert - SEV1",
            "team": "intern-oncall"
        }
    }
]'

Matcher is on the label team and goes to slack-receiver.

- match:
    team: intern-oncall
  receiver: slack-receiver
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10m
  repeat_interval: 4h
  continue: true
- name: 'slack-receiver'
  slack_configs:
  - channel: '#oncall-alerts'
    text: " {{ .CommonAnnotations.summary }}"
    api_url: "https://hooks.slack.com/services/...."
    send_resolved: true

Code Pushed. Deployed to prod. (On track for a 5-star rating in my year-end appraisals).

A few days later, when the team which asked for this started to trigger alerts, they reported that they were getting Slack messages twice for a single alert.

On checking this, I see that alerts are repeated exactly after 10 minutes.

slack-screenshot-of-duplicate-alert

Ah, yes, I had set group_interval as 10 minutes.

From alertmanager docs

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.) If omitted, child routes
# inherit the group_interval of the parent route.
[ group_interval: <duration> | default = 5m ]

So group_interval shouldn't be the issue; it should trigger alerts only 10 minutes after another alert is triggered within 10 minutes after the first alert is sent. However, I could find only one alert being triggered as per logs.

Also, even if multiple alerts were triggered, I ensured they were considered separate groups by adding a UUID to the group_by parameter alertname. We were still seeing duplicate alerts exactly after the group_interval duration.

The create alert API had another request body param - endsAt. This indicates the alert's end time, after which the alert is considered resolved.

I set endsAt to be 1 minute in the future so that after the initial group_wait time of 10s, the alert is fired, and before the group_interval time of 10m, the alert ends. So the duplicate alert shouldn't be sent, right? But no, alerts are still repeated after group_interval 10m. :angry:

Now, I am starting to question my sanity. A StackOverflow comment suggested setting a high value for group_interval. I put 12h, but the alerts still got repeated. Now, after 12h.

I thought this would not work, and I had a fallback: scrape alertmanager and call a Slack webhook to send the alert.

I decided to go through the configs one last time.

- match:
    team: intern-oncall
  receiver: slack-receiver
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10m
  repeat_interval: 4h
  continue: true
- name: 'slack-receiver'
  slack_configs:
  - channel: '#oncall-alerts'
    text: " {{ .CommonAnnotations.summary }}"
    api_url: "https://hooks.slack.com/services/...."
    send_resolved: true

send_resolved ??

# Whether to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

I had this set to true. Who is triggering resolved alerts? Could this be the duplicate alert?

I set it to false, and I no longer got duplicate alerts!

During all this time, I have not been getting duplicate alerts. The second alert was the alert resolved notification. Alertmanager was internally triggering the same alert for the resolved alert notification. It was triggered right after the alert was resolved. But since we had a group_interval of 10m, and the resolved alert got added as part of the same alert group, it got sent after 10m.

I wasted quite some time and mental space on this, but this made alertmanager, which was once a black box for me, into a white box.

alert-lifecycle-excalidraw-diagram

This is the ideal flow of an alert based on the values of the configuration parameters below.

startsAt, endsAt, group_by, group_wait,
group_interval, repeat_interval, send_resolved

One or more of these states can be skipped. For example, setting the endsAt to be before the group_wait time can prevent the alert from being fired. It goes from unprocessed -> pending -> resolved.

Have to be careful next time when copy pasting boilerplate configurations.

References

  1. https://prometheus.io/docs/alerting/latest/configuration/

  2. https://jaanhio.me/blog/understanding-alertmanager/

  3. https://github.com/prometheus/alertmanager/issues/1005

0
Subscribe to my newsletter

Read articles from Sidharth Shambu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sidharth Shambu
Sidharth Shambu