A misadventure with Terraform Sets & PagerDuty Schedules
"T, why didn't I get this page?" ๐คจ
"Wait, why does it show that <other_person> is on call? They just did it the other week." ๐ง
Are two phrases that you don't want to hear after making changes to your PagerDuty schedules terraform.
Intro
In the last couple of weeks, I've been leading the efforts to on-board 3 new engineers to our on-call rotation. As part of that work, one of the tasks is to get those engineers added to PagerDuty(PD), the app we use for managing on-call shifts and alerting. While this can easily be done in the PD UI, we implement these changes via Terraform so that it's documented, codified, and tracked via version control. Also because it adds another layer of auditability.
Some key concepts for working with Pagerduty:
- A
schedule
determines the WHO, and WHEN. (Who will be in the rotation, how long the rotation will be, and when the rotation starts).
- An
escalation policy
determines the ordering/logic for which schedules get paged.
- A
service
is what represents your service (or system) and will be linked to anescalation policy
.
So from the top:
When a
service
has an alert, PD will look at theescalation policy
.Based on the
escalation policy
and the current situation (i.e. first alert, first loop), PD will notify the appropriateschedule
You can see a full gist of the old code here.
An important note for this example is that my team is actually considered a subteam (A) that shares its pager with subteam (B)
Before
Prior to this work, I had originally set my schedule up as follows:
I also had each person's membership in a PagerDuty team
like this:
Given that:
I was specifying an association to a user twice AND
Creating a new resource for each team membership; I wondered if I could refactor this.
Enter, the Good Idea Fairy ๐ง๐ผ
Since my last brush with Terraform, I'd like to think I'd gotten better with it - especially with the use of for_each
statements. So when looking at a solution to this "problem" - I thought:
Why not just create a
locals.members
list with all the users, and then use that as (1) the members for theschedule
and (2) to have a single statement to create theteam_memberships
via a for_each?In FACT! Since we have two subteams, I could create two lists and simply combine them!
After
This is what I ended up with after refactoring and thinking what I thought were good changes. Gist.
I thought I was pretty slick by doing the following:
- Setting up the list of teammates in a local variable.
locals {
my_team_subteam_a_members = toset([
pagerduty_user.thilina_ratnayake.id,
pagerduty_user.teammate_b.id,
pagerduty_user.teammate_c.id,
])
my_team_subteam_b_members = toset([
pagerduty_user.teammate_a.id,
pagerduty_user.teammate_d.id
])
}
- Use the list from above with
setunion()
to combine both subteams A and B.
resource "pagerduty_schedule" "myteam_schedule" {
name = "My Team"
time_zone = "America/Los_Angeles"
description = "PD Schedule for My Team, Slack #my-team, Email: my-team@company.com"
layer {
name = "weekday"
rotation_turn_length_seconds = 1209600
rotation_virtual_start = "2023-01-1T09:00:00-08:00"
start = "2023-01-1T09:00:00-08:00"
users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)
}
}
- Iterate through the memberships for each subteam.
resource "pagerduty_team_membership" "my_team_subteam_a_members" {
for_each = local.my_team_subteam_a_members
user_id = each.value
team_id = pagerduty_team.my_team_subteam_a.id
}
resource "pagerduty_team_membership" "my_team_subteam_b_members" {
for_each = local.my_team_subteam_b_members
user_id = each.value
team_id = pagerduty_team.my_team_subteam_b.id
}
Except, I wasn't. Because this didn't go as planned - and the day after I made the changes we noticed that the PagerDuty schedules were completely off.
The Reason
In a schedule, ordering matters.
Before, we had specified the ordering and had that ordering based on a start date. That meant that after every interval (rotation), the next person would be in the hot seat to carry the pager.
However, when we did:
users = setunion(local.my_team_subteam_a_members, local.my_team_subteam_b_members)
This ended up doing a union
of the sets, which completely changes & disregards the order. In fact, that's actually specified in the documentation that I missed ๐คฆ๐ฝโโ๏ธ:
> setunion(["a", "b"], ["b", "c"], ["d"])
[
"d",
"b",
"c",
"a",
]
The given arguments are converted to sets, so the result is also a set and the ordering of the given elements is not preserved.
By doing a setunion
on the locals.my_team_subteam_a_members
and locals.my_team_subteam_b_members
- the ordering was completely disregarded which led to PagerDuty setting up someone that wasn't scheduled as the person on-call for the rotation
Conclusion
While it's great to be DRY
and avoid the repetition of values - that shouldn't get in the way of functionality. With regards to Terraform:
If ordering matters in a list, don't use
setunion()
Especially if you're setting up a PagerDuty schedule, just "hardcode" / manually specify the rotation order.
Subscribe to my newsletter
Read articles from Thilina Ratnayake directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Thilina Ratnayake
Thilina Ratnayake
I'm a developer from Vancouver, BC who's had an interesting journey in tech starting from support, through cloud infrastructure and project management. Currently I work as an SRE at lightstep helping build and "operationalize" things that helps to guide others towards better o11y :)