⏰The Time Lapse Bug | A Glitch Worse Than Y2K⏳
Many moons ago, I stumbled upon a curious field notice from Cisco. It detailed how an MDS switch would mysteriously reboot after exactly 233 days of operation. Now, picture the IT department’s reaction when this unexpected reset happens. "Koos! Did you trip over the power cord again?" "Boss, I swear, I did nothing, I was having a smoke outside!" The kicker? This wasn't Cisco's first rodeo with time lapse bugs. They've had a history of similar issues, like the one reported in Network World. Unlike the overhyped Y2K bug, these time lapse bugs are very real and can create significant outages.
Where Testing Comes Up Short | The Good Old Soak Test
Traditional testing methodologies often fall short when it comes to time lapse bugs. These are the kinds of bugs that only reveal themselves after extended periods of operation—something that most testing environments aren't set up to catch. The best defense? Redundant systems and a rigorous, sequential updating process. It might seem obvious, but I’ve seen techies make the fatal error of patching all redundant systems simultaneously. When asked why they were taking such a risk, I was told it was to minimize downtime. Sure, that strategy might make sense in terms of minimizing time, but it completely disregards the risk of time lapse bugs. The fallout from a major incident due to a time lapse bug makes it clear that updates should always be done sequentially.
The Bug That Kills Redundancy
Imagine you're in a data center, feeling confident that your redundant systems will protect you from downtime. Then, BAM! A time lapse bug triggers a reboot on one system. No worries, right? You've got redundancy. But wait—what's this? The second system is also rebooting! "Don't worry, mate. We’ve got two of those. What the heck! The second one is also rebooting! RED ALERT! SEV 1! WAKE THE NEIGHBORHOOD UP!" The risk here is compounded if your updates result in version incompatibility, which could prevent the redundant systems from maintaining their state. In this case, it’s better to temporarily operate with limited (or no) redundancy than to risk a catastrophic simultaneous failure due to a time lapse bug.
The Cow That Only Lived 256 Days
This tale reminds me of another time lapse bug I encountered years ago with the Madge SmartCau Plus, a token-ring network hub. The hub had a nasty habit of locking up after exactly 255 days of uptime. Why? The developer had used a byte data type to record the number of days of uptime, but the program logic treated it as a word data type. When day 256 rolled around, the variable overflowed, and KAPOW! We had customers perform controlled reboots as a workaround, buying us 255 days to load new, bug-free code.
Lesson Learned | Patch Sequentially
The moral of the story is simple: always patch sequentially. It’s a lesson learned the hard way, but one that can save you from the chaos of a time lapse bug triggering a major incident.
Reality | Super Tuesday Habits. Patch Everything at Once.
Despite the wisdom of sequential patching, many IT teams still opt for the “Super Tuesday” approach—patching everything at once. It's a risky move that ignores the very real dangers posed by time lapse bugs.
The "Clownstrike" Incident | A Variant of the Time Lapse Bug of Epic Proportions
Time lapse bugs can be sneaky, but sometimes they turn into full-blown disasters. One such incident, dubbed "Clownstrike," is a prime example of how a seemingly minor oversight can spiral into the largest IT outage in history. This catastrophe was triggered by an update pushed by a well-known cybersecurity firm, Crowdstrike.
Here’s what went down: the update included 21 processes, but the code had only been tested and designed to handle 20. That’s right—just one unaccounted-for process was all it took to bring entire systems to their knees. The result? A worldwide IT outage that impacted countless organizations, all because a hidden bug that slipped through the cracks.
At least the Madge SmartCau Plus had the excuse of being limited to a byte—our "Clownstrike" friends couldn't even count higher than the fingers and toes on a human. The lesson here is clear: even the most advanced systems are vulnerable to the simplest mistakes, especially when time lapse bugs and their variants are involved.
Counting Matters | The Devil's in the Details
The "Clownstrike" incident highlights a critical truth about time lapse variant bugs: they're often born out of overlooked details. In this case, the oversight was in the number of processes. It’s a stark reminder that in the world of IT, counting matters. Whether it's 21 processes or 256 days of uptime, missing the mark can lead to catastrophic consequences. The incident serves as a cautionary tale for IT professionals everywhere: always double-check your numbers, and remember that even a small discrepancy can lead to a monumental failure.
Prevention | Testing, Counting, & Caution
The best way to avoid falling into the "Clownstrike" trap is to ensure rigorous testing and validation. If your system is designed for 20 processes, make sure it can handle that 21st one, or better yet, build in a buffer. And always, always count higher than the fingers and toes on your hands. The difference between 20 and 21 may seem trivial, but as "Clownstrike" proved, it’s anything but.
In the end, the "Clownstrike" incident wasn’t just a time lapse bug—it was a wake-up call. It showed that even the most sophisticated IT environments are susceptible to simple mistakes. So, take heed: when it comes to updates and patches, leave no stone unturned, and no process uncounted.
Warning | Major Incident Tsunamis Are Inevitable
Time lapse bugs are like ticking time bombs. If you're not careful, they’ll cause a major incident tsunami that will wreak havoc on your systems. So, take heed—patch sequentially, keep an eye out for time lapse bugs, and always be prepared for the unexpected.
Ronald Bartels ensures that Internet inhabiting things are connected reliably online at Fusion Broadband South Africa - the leading specialized SD-WAN provider in South Africa. Learn more about the best SD-WAN provider in the world! 👉 Contact Fusion
Subscribe to my newsletter
Read articles from Ronald Bartels directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Ronald Bartels
Ronald Bartels
Driving SD-WAN Adoption in South Africa