In Layer 2 Ethernet networks, resilience is key. Unlike Layer 3, where routing protocols dynamically reroute traffic, Layer 2 has limited options when it comes to path redundancy. Enter Link Aggregation (LAG)—a method that allows multiple physical links to be bundled into a single logical link, ensuring both higher bandwidth and resilience against failures.

But here’s the catch: LAG isn’t a set-and-forget solution. Many engineers deploy it, assume it's working, and move on—only to get a nasty surprise when a failure occurs without any alerts. If your LAG fails silently, that’s a monitoring failure, not just a network failure.

Why Use Link Aggregation?

1️⃣ All Links Forward Traffic

One of the biggest advantages of LAG is that, unlike spanning tree (which blocks redundant links), all links in the bundle are actively forwarding traffic. This means:
✅ Increased throughput by combining bandwidth across multiple links
✅ Automatic failover—if a link fails, traffic shifts to the remaining links
✅ Less complexity compared to other redundancy methods

How Does LAG Detect Failures?

🔍 1. Basic Link Status (The Default Check)

By default, most switches determine whether a LAG member is active based on link status (up/down). If a cable is physically unplugged or a switchport goes down, the device removes the failed link from the aggregation.

Sounds good, right? Well, not always.

💀 The Problem:
What if the link stays physically up but traffic isn’t actually passing? 🤔

This can happen due to:
❌ One-way traffic failures (e.g., unidirectional fiber failure)
❌ Misconfigured VLANs, preventing some traffic from flowing
❌ Physical layer degradation (e.g., bad optics, high CRC errors)

Link status alone isn’t enough to ensure all LAG members are actually working.

🛠 2. LACP (Link Aggregation Control Protocol)

LACP is the most commonly used dynamic LAG protocol. Instead of assuming a link is working, LACP sends PDUs (Protocol Data Units) down each member link. This helps detect certain failures:
✅ Misconfigured LAG groups (prevents plugging into the wrong switch)
✅ Link negotiation issues
✅ Ensuring all members belong to the correct aggregation group

⚠️ But LACP isn't fast.
LACP failure detection relies on timer-based mechanisms, which aren’t always quick enough for critical networks.

💡 Pro Tip: You can adjust the LACP timer to improve failure detection (Fast LACP mode sends PDUs every second instead of every 30 seconds).

🚀 3. BFD (Bidirectional Forwarding Detection) for Faster Failure Detection

For networks needing sub-second failure detection, BFD (Bidirectional Forwarding Detection) is the go-to solution.

Unlike LACP, BFD:
✅ Works at Layer 2 or Layer 3
✅ Can detect failures in as little as 50ms
✅ Doesn’t rely on periodic LACP PDUs—BFD runs separately to actively confirm if traffic is flowing correctly

BFD + LACP?
Yep! You can run both. LACP ensures the LAG is properly formed, while BFD provides rapid failure detection and ensures the data plane is working as expected.

🚨 The Hidden Danger | LAG Without Monitoring 🚨

LAG is great—but only if you know when it fails. A broken LAG member won’t always trigger an alert unless you actively monitor it.

Too many engineers deploy LAG once and never check it again. That’s like putting a spare tyre in your car, but never checking if it’s inflated when you need it.

How to Properly Monitor LAG

📊 1. Set Up SNMP Monitoring for LAG Interfaces

Use SNMP polling to check:
✅ LAG status (is the bundle still active?)
✅ Number of active links (are all members up?)
✅ Traffic distribution (is one link overloaded while others sit idle?)

💡 If a LAG member fails and your monitoring doesn’t catch it, your problem isn’t just a network failure—it’s a monitoring failure.

🔔 2. Configure Alerts for LAG Failures

Your NMS (Network Monitoring System) should notify you if:
❌ The LAG goes from 4 links to 3 (or any reduction in members)
❌ Traffic on one member drops to zero
❌ LACP negotiation fails

📈 3. Periodically Verify LAG Performance

Test failover manually from time to time. If a LAG failure goes undetected for months, then suddenly a second link drops, you’ll experience an unexpected outage.

🔄 Conclusion: LAG Works—If You Do It Right

LAG is the best way to provide resilience at Layer 2. It ensures:
✅ Higher bandwidth by bundling multiple links
✅ Automatic failover if a link drops
✅ Better load balancing than spanning tree

However, LAG isn’t perfect. Relying on default link status checks is risky. For better failure detection, you should:
✔️ Use LACP to detect misconfigurations
✔️ Deploy BFD for faster failover
✔️ Monitor your LAG with SNMP & alerts

Remember: LAG failure without monitoring = Monitoring failure.

🔹 Wrapping Up | The Classic LAG Horror Story

A large enterprise had a 4-member LAG to their data centre. One link failed silently, but no one noticed because traffic simply shifted to the remaining three links.

Months later, a second link failed—cutting bandwidth in half.

Still, no one noticed.

Then, a third link failed. Suddenly, the company saw huge performance issues, and only then did the IT team investigate.

By then, they were running on a single remaining link—one failure away from a total outage.

Had they monitored their LAG properly, they would’ve fixed the issue after the first failure, instead of waiting for a near-disaster.

So, are you monitoring your LAG?

If not, start now. Otherwise, the next outage might already be in progress. 🚨

More Information:

https://youtu.be/4P9cnoJGl50?si=YcyVbASpouGkrB-y

🔗Link Aggregation (LAG) | The Backbone of Layer 2 Resilience✂️