Imagine inheriting a 25-year-old project weighed down by technical debt and a team exhausted from constant firefighting. That was my reality in 2020 when I joined a new company to take over a project that upper management had labeled "stuck."

Years of mergers, acquisitions, and losing domain experts had left the project in disarray. Outdated tools made even the simplest tasks difficult. Yet, despite these challenges, my team and I turned things around. The project runs smoothly today, and we take pride in what we’ve built. But getting here was no easy feat — it was a journey full of lessons about managing burnout, technical debt, and teamwork.

One of my first challenges was dealing with the crushing workload on certain team members. Many were stuck in "firefighting" mode, unable to focus on structured, strategic work. This not only burned them out but also held the entire project back.

In this article, I'll explain how we helped one overwhelmed team member regain control. I also wrote another piece about how to help the whole team.

The Signs of Trouble

My first day started with a flood of emails about the same issue: our logging system was down, triggering nonstop alerts that annoyed the team and a dozen other managers.

What happened? Our DevOps engineer (let’s call him Andrew) was updating our logging tool, Graylog, when the Kafka mirroring cluster on production went down. Andrew, the only person who knew how to fix it, had to drop everything to resolve the issue, leaving the Graylog update unfinished and still firing alerts.

It didn’t take long to see that Andrew was under immense pressure. During our first one-on-one, he admitted, "I’m feeling a bit worn out." He constantly bounced between crises, with no time to address root causes or make lasting improvements.

Worse, we had a bus factor of one for Andrew. If one day he decided to leave (and the chances of that were higher because he was overwhelmed), the whole project would be in serious trouble.

This was a typical antipattern often described as “DevOps is a culture, not the role”. Relying too heavily on one person is a way to a fragile, unsustainable system.

Hiring another DevOps engineer seemed simple, but we had strict staffing limits. But even if we could, it wouldn’t fix the core issue: Development and Operations were working in silos despite being part of the same team.

Diagnosing a Root Cause: a Pile of Unfinished Work

So, why was Andrew so overwhelmed?

The root issue was a mountain of unfinished work—quick fixes and half-implemented solutions that had built up over the years. These incomplete tasks required constant maintenance, catching Andrew in a vicious cycle of firefighting.

Vicious cycle of firefighting

Take the abovementioned Kafka mirroring cluster, for example. It crashed three times in my first two weeks. After the third crash, I reached out to Andrew to find a systemic solution, and 15 minutes later, we realized we could move two services to the primary data center, making this Kafka mirror unnecessary. Years ago, this mirror was introduced as a temporary solution to migrate some services to the primary data center for GDPR compliance. But like many temporary solutions, it stuck around indefinitely.

Another example was our messy production setup. We had four different deployment methods: RPM installations on VMs, Docker, self-managed Kubernetes, and AWS. Only Andrew knew how to manage all of them. Over the years, these tools were introduced but never fully adopted, creating a "Lava Flow" anti-pattern — an incomplete, messy setup that hardened into permanent. We simplified deployments by standardizing on Docker and cutting 70% of our Jenkins code, reducing cognitive load and allowing the whole team to become proficient in our delivery pipelines.

And, of course, there were plenty of undocumented tasks that only Andrew knew how to do.

The pattern was clear: unfinished work was the cause and the effect of Andrew’s overload.

The Solution: Identifying and Distributing Workload

The solution was obvious:

Identify all unfinished tasks.
Spread responsibilities across the team to reduce reliance on a single person.
Free up Andrew’s time for high-value work instead of constant interruptions.
Ensure the team can operate smoothly even when Andrew is away.

This is easier said than done, starting from identifying unfinished work. When someone is overwhelmed, it’s hard to see the whole picture and identify long-term solutions. Andrew could only recall the latest fires he had put out, a common experience for those stuck in reactive mode.

I turned to David Allen’s Getting Things Done method and its “trigger list” technique. This method helps to offload mental clutter and identify everything demanding attention. For us, it was a way to detect unfinished work.

I customized my "triggers list" for Andrew, adding all our services and tools.

Triggers list inspired by David Allen’s Getting Things Done (GTD) methodology

Then, I asked Andrew to:

Set aside a few hours without distractions—no phone, no messaging apps.
Go through a list of triggers and jot down any tasks that come to mind. Don’t try to organize them — aim for quantity, not quality. Refining comes later.
When your mind feels clear, take a break. Avoid doing mental work—go for a walk or have a meal instead. More tasks will likely come to mind, and they should be written down, too.

Andrew was on board with the idea since he felt overwhelmed by work and personal matters. We agreed he’d go through this exercise and share the work-related part with me to tackle them together.

But unsurprisingly, he struggled to find the time. So, I suggested he take two days off solely to focus on creating the list. The next day, he returned with a stack of notes we finally turned into 200 tasks!

We broke that massive list into manageable categories and identified an approach for each:

Teach: Many tasks, like restoring Kafka, weren’t difficult — just undocumented. We started informal "morning coffee" sessions where we broke production and Andrew guided us through hands-on exercises. We recorded each session, creating a reference library for the future.
Document: we had a massive lack of documentation about our day-to-day maintenance routines. Instead of forcing full documentation upfront, we wrote instructions as tasks arose. Then, another teammate would follow the steps from the doc and refine it, ensuring clarity and usability.
Do: We divided tasks into three categories: those only Andrew could do, those requiring his guidance, and those the team could handle independently. Interestingly, some of Andrew’s routine tasks were exciting challenges for others. We assigned tasks based on each person’s interests to contribute meaningfully and learn.

The Results

Over the next few months, we redistributed Andrew’s workload. For example, I took over our delivery pipelines, reworking them and improving delivery speed and stability with DORA metrics.

Our litmus test was whether Andrew could spend a vacation without our emergency calls.

It took nearly a year, but we got there. The project became significantly more stable, and the team gained the confidence to handle long-standing issues independently. We fully embraced Continuous Integration and Delivery (CI/CD), transitioned to Infrastructure as Code (IaC), and improved our observability tools. DevOps was no longer just Andrew’s responsibility—it became embedded in our team culture.

The impact was not just surface-level adjustments. We delivered real, measurable results:

Incident frequency dropped: We used to have 1–2 urgent incidents per week. Within a year, this was reduced to one every 2–3 months, allowing us to focus on long-term goals and improving our ability to meet expectations, restoring the business’s trust in our commitments.
Proactive risk mitigation: Instead of scrambling to fix last-minute critical vulnerabilities, we managed risks ahead of time, preventing disruptions before they occurred.
Risk Reaction vs. Prevention: Previously, we mainly dealt with problems only after they blew up, like updating an outdated OS only after the vendor removed the image. A year later, we were handling most risks proactively, preventing disruptions before they could occur.

Conclusion

Burnout and over-reliance on a single person are signs of an unhealthy work environment, not individual failings. The reason is often piles of unfinished work that repeatedly disrupt the team.

If some of your teammates are in a similar situation, take action:

Identify and prioritize unfinished tasks.
Distribute responsibilities across the team.
Foster a culture of knowledge sharing.
Ensure the team can function smoothly, even in the absence of key members.

This transformation won’t happen overnight, but each step makes a difference. Aim for the ultimate sign of success: you and your team can take a well-earned break without emergency calls.

Life is a lot better when the fires stop burning.

Fedor

Relying on Heroes Is Not a Strategy: Overcoming Teammate Overload

Table of contents

The Signs of Trouble

Diagnosing a Root Cause: a Pile of Unfinished Work

The Solution: Identifying and Distributing Workload

The Results

Conclusion

Subscribe to my newsletter

Fedor Shchudlo

Fedor Shchudlo

Relying on Heroes Is Not a Strategy: Overcoming Teammate Overload

Table of contents

The Signs of Trouble

Diagnosing a Root Cause: a Pile of Unfinished Work

The Solution: Identifying and Distributing Workload

Sharing the Work and Spreading Knowledge

The Results

Conclusion

Subscribe to my newsletter

Fedor Shchudlo

Fedor Shchudlo