PART 1 — The Inheritance: Picking Up the Burning Baton

I don’t usually get called into situations when everything is running smoothly. My phone rings when things are broken. And that’s fine by me. That’s where the interesting work and learnings are.

This is a story of one of those times.

When I stepped into this project, I inherited a team that was, for lack of a better word, leaderless. There were people with fancy titles like Director of This and Chief of That, but there wasn’t much leadership happening. The Director flat-out told me he didn’t see the dozens of engineers around him—who showed up five days a week, wrote code, shipped (or attempted to ship) product—as his team. The Chief of That was busy protecting turf, undermining trust, and avoiding responsibility. But, I was told, we're just a few weeks from launching our new product. It's going well.

These two had managed to create what I can only call a "performance theater." On paper, everything looked fine: roadmaps existed, backlogs were groomed, standups were happening, and the dashboards were green. But scratch beneath the surface and you were neck-deep in dysfunction.

During my first week, I was asked to deliver a performance improvement plan (PIP) to an engineer I had never met—who, by the way, was out on leave. There was no documentation justifying the PIP. None. Turns out, this PIP had been discussed extensively in large, highly inappropriate circles, but never actually communicated to the person it targeted. Months earlier, that engineer had caught wind of it through gossip—torpedoing any chance of a constructive turnaround. I'm speculating here, but I think their hope was that he'd find out and leave to avoid the uncomfortable discussion. Hope as a strategy did not work here.

More baleful though, was that the leadership's style, if you can call it that, was to manufacture accountability optics without doing the hard work of leadership: setting expectations, providing coaching, and holding people accountable in a direct, fair, and timely way. It was dysfunction disguised as process. It was false transparency. It created such waste.

Process-wise, the team had heard of agile. They knew the ceremonies. They held standups. Stories moved across boards. They were performing agile the way a cargo cult performs ritual: going through the motions, hoping the desired outcomes would magically materialize. Some teams were better than others, but even the good ones struggled with fundamentals. On the surface, they could say, "Look! We do agile!" But they weren’t delivering value. They weren’t learning. They weren’t improving. The rituals became a shield against scrutiny rather than a means of execution.

Operationally, things were broken in two very different ways. First, I saw massive pull requests: 3,300-line PRs submitted by some of the most junior engineers. Github struggled to show the diff. I was aghast. But I kept my poker face to observe how things played out. These PRs were reviewed and approved with minimal scrutiny. And this wasn’t a one-off. Giant, unreviewable PRs were crafted on the daily.

The second failure mode was almost the opposite. Another team was stuck in a broken delivery pipeline that barely functioned. They couldn’t test code locally. The only way to know if something worked was to push a change and wait for the pipeline—which would often fail for reasons entirely unrelated to their change–to see how prod reacted. One failed push clogged the entire deployment pipeline, which had many subsequent PRs on its heels. Their only option was to submit endless micro-PRs, often single-line changes with commit messages like "test the thing."

Both paths led to the same result: technical debt, fear of change, and system fragility. The first team lacked technical mentorship and review discipline. The second was paralyzed by a broken toolchain.

PagerDuty alerts were firing 120 times a month. That turned their Pager Duty into Pager Call of Duty: Modern Nightmare.

Meanwhile, the system was on fire around the clock. PagerDuty alerts were firing 120 times a month. That turned their Pager Duty into Pager Call of Duty: Modern Nightmare. Engineers dreaded being on-call because, when things broke—and they broke constantly—nobody knew how to fix them. The most common remediation pattern? Wait. Literally wait. Many issues were transient and would mysteriously resolve themselves. And when you design a system that behaves this way, you inadvertently train your team to hope that problems simply disappear. It was operational learned helplessness. It sucked.

The cherry on top: the dashboards were green. Not uptime dashboards, mind you. Just deployment dashboards: "Did the function deploy successfully? Yes? Great!" Never mind that it didn’t work once deployed. Never mind that the data models were inconsistent. Never mind that customer-facing functionality was broken or brittle. It was the perfect mirror of the culture: optimized for appearances, allergic to truth.

This was the state of things when I arrived. But it gets worse before it gets better.

PART 2 — Architectural Hubris and the Birth of the Distributed Monolith

The distributed monolith didn’t happen by accident. It was engineered, intentionally, and with great confidence. They started, as so many do, by grabbing the latest industry reports and adopting them wholesale, without consideration or understanding. Enter The State of DevOps report. It identified certain practices that high-performing companies tend to follow. Things like trunk-based development, microservices, and serverless architecture. All good things—in the right hands, at the right time, for the right reasons.

But they fell into a classic logical trap. The report said, in effect: high-performing companies do X. So they reasoned: if we do X, we will become high-performing. That’s not how logic works.

Fun Sidebar. If you haven’t studied formal logic, let me give you the quick version. In logic, we write P → Q, which means: if P is true, then Q is true. If it rains, the grass gets wet. But that doesn’t mean that if the grass is wet, it must have rained. Maybe the sprinkler ran. Maybe it was a slip-and-slide. Q (wet grass) can happen for many reasons besides P (rain). But they assumed the reverse: that by copying behaviors, they’d automatically achieve outcomes. They thought they were proving P ⇔ Q. (That is read, "P if and only if Q," the biconditional.) They weren’t.

Fun, right?

The broken culture with this report as fodder provided the nucleation point of the poorly designed architecture, the distributed monolith. It wasn't working for them and they knew it. They argued constantly about architecture, but never made decisions or moved forward. In fact, there is where I coined Jason's Law, my own little corollary:

Jason's Law: As any technical debate grows longer, the probability of someone invoking Conway’s Law approaches one.

And sure enough, within days of my arrival, Conway’s Law was being thrown around like a magical incantation. The org chart is the architecture! They said it with such confidence. But they were using it not as a diagnosis but as a defense.

The architecture? The result of this thinking was, quite literally, 120 GitHub repositories. Not 120 services. 120 CRUD endpoints masquerading as microservices. User-READ was a repo. User-WRITE was a repo. User-CREATE, User-DELETE, each their own repo. Each of these tiny repos owned their own notion of what a User was, because there was no shared schema, no central data model, no relational database. Just a document store with scattered, conflicting definitions smeared around like peanut butter.

Worse, every one of these repos deployed as its own serverless function. Each was fragile. Each was independent in name only, but deeply intertwined in practice. They had invented, through sheer cargo-cult enthusiasm, a distributed monolith—harder to operate, harder to debug, and utterly brittle under load. Yet, they insisted they were doing microservices because, well, they had read that microservices were good.

In parallel, they stumbled into a separate flavor of configuration hell. The twelve-factor app recommends strict separation of config and code. Instead, they jammed 3,000 lines of environment-specific conditionals into a single config file: if prod, do this; if staging, do that; if dev, do something else entirely. It was barely parseable by humans, let alone safe for machines. Don't get me wrong: I'm a big fan of the 12 Factor App methodology. But, make sure you implement it correctly. I can still see in my mind's eye the Slack message from one of my favorite contractors upon seeing this who wrote:

OMG. This file.

The entire system wobbled and flapped constantly. But to the untrained eye—or the willfully blind manager—everything appeared fine. Green lights. Successful deploys. Nobody wanted to scratch beneath the surface.

The infrastructure wasn’t the only thing wobbling. The culture itself was trapped in a self-reinforcing downward cycle. Because problems appeared intermittently and self-resolved often enough, the team became conditioned to wait. When a deployment broke, when a service failed, when PagerDuty screamed at 2 a.m., the response was rarely root cause analysis. It was: let’s see if it fixes itself. More often than not, it did. But that survival tactic slowly eroded any sense of ownership, any drive to understand the system. Debugging turned into gambling.

I realized that what looked like isolated technical problems were really symptoms of deep cultural failures. The architecture wasn’t fragile by accident. It was fragile because leadership rewarded people for appearing productive, not for building durable systems. The cargo cult adoption of DevOps principles wasn’t a random misstep; it was the byproduct of leadership chasing buzzwords instead of building competence. The 120 repositories weren’t created by malicious intent; they were born from sincere but misguided attempts to emulate high-performing companies without understanding why those companies made the choices they did.

It wasn’t bad people. It was bad leadership. It was bad logic. And left unchecked, it nearly guaranteed technical collapse.

Oh, and when I started I was told, "We're really close to going to production." No. No, you are not. Your document store should not be doing 300k reads/second with minimal traffic. You should not write on read, etc. You built something you cannot understand. It needs to be much simpler. You need observability. You need to understand its operational characteristics.

PART 3 — The Turnaround

While full of great lessons, I'll skip the cultural portion of the turnaround for another post. Like many turnarounds it was pretty straightforward. Folks who live in a mess know they live in a mess. They don't like it. They want the change. They complain to their partners after work. They usually know where the pain is coming from. So, observing, listening, and doing the obvious usually goes a long way. But, the key lesson here is: buy-in and alignment make sweeping changes much easier because you're not fighting against the team. Small changes in behavior from the top obviate many needless fights. By the time I get going, the team already sees the issues and has clear ideas on how to fix them.

Turning technology around didn’t start with big proclamations or sweeping change management initiatives. I dribbled in new cultural expectations as mentioned above. We started small, almost boringly small. The first order of business was to give the engineers a place where they could actually write code, test it, and see what it did—locally, on their machines, without having to deploy to production just to validate a change. This is the power of a solid developer experience. As a talented engineer once told me, "they need to know that the code they just wrote will do what they expect it to."

The repo sprawl had to end. With the help of a very talented senior consultant, we consolidated those 120 repositories into a single, unified monorepo. We preserved git history. We cleaned up tons of duplicated and conflicting domain models. Code that had been scattered across dozens of places, each with its own slightly different understanding of what a User or Order or Transaction was, finally came together into shared models with real integrity. Consider dependency version pinning across 120 disparate repos that purportedly all exercised the logic of a single domain. Once coalesced into a single monorepo, we had to untangle that mess. But, we did and many gigs of dependencies were removed.

We then could build a sane local development environment. This one change immediately broke the dam. Engineers could finally test real functionality before committing code. The micro-PR storm slowed. Instead of submitting endless one-line changes labeled "test the thing," they could work in logical units of work. Larger, coherent pull requests started flowing in—not 3,000-line monsters, but healthy, reviewable bodies of code. Fixing local development paved the way for better reviews, better tests, and, critically, lower anxiety. The on-call nightmare faded. PagerDuty alerts dropped from 120 incidents a month down to single digits. Suddenly, the on-call schedule was no longer feared. People could sleep through the night.

Git branch protections went into place. No more force-push disasters where someone accidentally nuked main with git push -f. Tests were written. Pipelines stabilized. Deployments became predictable instead of terrifying.

Most importantly, the culture started to heal. People began taking ownership. They weren’t afraid of their own system anymore. They understood it. They could change it confidently. I love this part. People see that they can advance in their career; they step up and, on their own, choose to take on more. To contribute to something larger than themselves.

We didn’t fix everything overnight. But we re-established a foundation—one that was rooted in clear technical thinking, honest leadership, and a real commitment to engineering maturity. From there, real growth could finally begin.

The deeper lesson here is not about code or architecture or even process. It’s about leadership, critical thinking, discernment, and saying – "yeah, that best practice works for that two trillion dollar company and my friends talk about it, but maybe we shouldn't do that (yet)". You can’t cargo cult your way into excellence. You can’t copy the observable behaviors of high-performing companies and assume their success will transfer. You have to understand why they made those choices, what tradeoffs they were managing, and what context they were operating within. Without that, you're imitating shadows.

The results here for the business were quite large. Increased happiness, delivery, reduced human and infrastructure costs, etc.

Leaders must resist the temptation to substitute jargon for judgment. Buzzwords don’t build systems. Thoughtful decisions do. Healthy organizations aren’t powered by standups or microservices or serverless functions. They’re powered by teams who understand their systems, trust one another, and are accountable for delivering value.

I write this not to mock others but to share some of the challenges and learnings I've seen. In a follow-up post, I'll focus more clearly on what I learned as opposed to what I did.

A Rewarding Turnaround

PART 1 — The Inheritance: Picking Up the Burning Baton

PART 2 — Architectural Hubris and the Birth of the Distributed Monolith

PART 3 — The Turnaround

Subscribe to my newsletter

Jason Vertrees

Jason Vertrees