A misadventure with Terraform Sets & PagerDuty Schedules

"T, why didn't I get this page?" ๐Ÿคจ

"Wait, why does it show that <other_person> is on call? They just did it the other week." ๐Ÿง

Are two phrases that you don't want to hear after making changes to your PagerDuty schedules terraform.

Intro

In the last couple of weeks, I've been leading the efforts to on-board 3 new engineers to our on-call rotation. As part of that work, one of the tasks is to get those engineers added to PagerDuty(PD), the app we use for managing on-call shifts and alerting. While this can easily be done in the PD UI, we implement these changes via Terraform so that it's documented, codified, and tracked via version control. Also because it adds another layer of auditability.

Some key concepts for working with Pagerduty:

  • A schedule determines the WHO, and WHEN. (Who will be in the rotation, how long the rotation will be, and when the rotation starts).

  • An escalation policy determines the ordering/logic for which schedules get paged.

  • A service is what represents your service (or system) and will be linked to an escalation policy.

So from the top:

  1. When a service has an alert, PD will look at the escalation policy.

  2. Based on the escalation policy and the current situation (i.e. first alert, first loop), PD will notify the appropriate schedule

You can see a full gist of the old code here.

An important note for this example is that my team is actually considered a subteam (A) that shares its pager with subteam (B)

Before

Prior to this work, I had originally set my schedule up as follows:

I also had each person's membership in a PagerDuty team like this:

Given that:

  1. I was specifying an association to a user twice AND

  2. Creating a new resource for each team membership; I wondered if I could refactor this.

Enter, the Good Idea Fairy ๐Ÿงš๐Ÿผ

Since my last brush with Terraform, I'd like to think I'd gotten better with it - especially with the use of for_each statements. So when looking at a solution to this "problem" - I thought:

Why not just create a locals.members list with all the users, and then use that as (1) the members for the schedule and (2) to have a single statement to create the team_memberships via a for_each?

In FACT! Since we have two subteams, I could create two lists and simply combine them!

After

This is what I ended up with after refactoring and thinking what I thought were good changes. Gist.

I thought I was pretty slick by doing the following:

  • Setting up the list of teammates in a local variable.
locals {
    my_team_subteam_a_members = toset([
        pagerduty_user.thilina_ratnayake.id,
        pagerduty_user.teammate_b.id,
        pagerduty_user.teammate_c.id,
    ])
    my_team_subteam_b_members = toset([
        pagerduty_user.teammate_a.id,
        pagerduty_user.teammate_d.id
    ])
}
  • Use the list from above with setunion() to combine both subteams A and B.
resource "pagerduty_schedule" "myteam_schedule" {
  name        = "My Team"
  time_zone   = "America/Los_Angeles"
  description = "PD Schedule for My Team, Slack #my-team, Email: my-team@company.com"

    layer {
    name                         = "weekday"
    rotation_turn_length_seconds = 1209600
    rotation_virtual_start       = "2023-01-1T09:00:00-08:00"
    start                        = "2023-01-1T09:00:00-08:00"
    users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)
    }
}
  • Iterate through the memberships for each subteam.
resource "pagerduty_team_membership" "my_team_subteam_a_members" {
  for_each = local.my_team_subteam_a_members
  user_id = each.value
  team_id = pagerduty_team.my_team_subteam_a.id
}

resource "pagerduty_team_membership" "my_team_subteam_b_members" {
  for_each = local.my_team_subteam_b_members
  user_id = each.value
  team_id = pagerduty_team.my_team_subteam_b.id
}

Except, I wasn't. Because this didn't go as planned - and the day after I made the changes we noticed that the PagerDuty schedules were completely off.

The Reason

In a schedule, ordering matters.

Before, we had specified the ordering and had that ordering based on a start date. That meant that after every interval (rotation), the next person would be in the hot seat to carry the pager.

However, when we did:

users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)

This ended up doing a union of the sets, which completely changes & disregards the order. In fact, that's actually specified in the documentation that I missed ๐Ÿคฆ๐Ÿฝโ€โ™‚๏ธ:

> setunion(["a", "b"], ["b", "c"], ["d"])
[
  "d",
  "b",
  "c",
  "a",
]

The given arguments are converted to sets, so the result is also a set and the ordering of the given elements is not preserved.

By doing a setunion on the locals.my_team_subteam_a_members and locals.my_team_subteam_b_members - the ordering was completely disregarded which led to PagerDuty setting up someone that wasn't scheduled as the person on-call for the rotation

Conclusion

While it's great to be DRY and avoid the repetition of values - that shouldn't get in the way of functionality. With regards to Terraform:

  1. If ordering matters in a list, don't use setunion()

  2. Especially if you're setting up a PagerDuty schedule, just "hardcode" / manually specify the rotation order.

1
Subscribe to my newsletter

Read articles from Thilina Ratnayake directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Thilina Ratnayake
Thilina Ratnayake

I'm a developer from Vancouver, BC who's had an interesting journey in tech starting from support, through cloud infrastructure and project management. Currently I work as an SRE at lightstep helping build and "operationalize" things that helps to guide others towards better o11y :)