Understanding Socket Load Balancing in Cilium: A Deep Dive in to how eBPF is used

In Kubernetes, Cilium uses a clever method for load balancing pod-to-service traffic known as the socket-level load balancer. This is part of the KubeProxy Replacement (KPR) initiative.

Load balancing is a critical component in modern Kubernetes clusters, ensuring efficient distribution of network traffic across multiple backend services. Let's explore how this complex system works under the hood.

How It Works

The socket-level load balancer utilizes BPF programs, specifically BPF_PROG_TYPE_CGROUP_SOCK and BPF_PROG_TYPE_CGROUP_SOCK_ADDR, to intercept socket events like connect or sendmsg syscalls. When a socket tries to connect to a service IP, the program modifies the IP to direct traffic to a backend pod. it's essential to understand two key components:

Socket Cookies - Unique identifiers assigned to sockets

Generated via bpf_get_socket_cookie
Used for tracking connections across the load balancer.

Load Balancing Maps(LRU Hash map) - Store forwarding state

Track NAT translations
Enable efficient lookup and routing

Example Scenario

Consider a service A with the ClusterIP 100.0.0.1 and backend pods 20.0.0.1 and 20.0.0.2. When a pod executes curl some.service.svc, it connects to 100.0.0.1, but the load balancer redirects it to a backend pod, resulting in:

curl some.service.svc -v
*   Trying 20.0.0.2:80...
* Connected to google.com (20.0.0.2) port 80 (#0)

Even though DNS resolves some.service.svc to 100.0.0.1, the connection is made to 20.0.0.2, demonstrating client-side load balancing.

Handling TCP and UDP

For TCP connections, this redirection happens once during the connect syscall, and the stateful nature of TCP ensures continuous operation. However, UDP can have both connected and unconnected sockets, requiring reverse NAT (revNAT) to match reply traffic with requests.

Reverse NAT and LRU Hashmaps

The revNAT mechanism ensures that UDP traffic appears to come from the service IP. We store reverse NAT information in an LRU (Least Recently Used) hashmap, which evicts the least recently used entries when the map is full.

Load Balancing Flow

Let's visualize the complete request/response cycle:

sequenceDiagram
    participant CP as Client Pod<br/>(10.0.0.5:50000)
    participant EBPF as eBPF Program
    participant LB as BPF Service Map and LRU hash map
    participant BP as Backend Pod<br/>(20.0.0.2:80)

    Note over CP,BP: Initial Request Phase
    CP->>EBPF: curl echo-service.svc
    Note over EBPF: Create socket cookie: 12345

    EBPF->>LB: Check service map
    LB-->>EBPF: Service exists (100.0.0.1:80)

    EBPF->>EBPF: Select backend pod<br/>20.0.0.2:80

    Note over EBPF,LB: NAT Entry Creation
    EBPF->>LB: Add reverse NAT entry:<br/>Key: {cookie:12345, addr:20.0.0.2, port:80}<br/>Value: {addr:100.0.0.1, port:80}

    Note over CP,BP: Forwarded Request
    CP->>BP: Source: 10.0.0.5:50000<br/>Dest: 20.0.0.2:80

    Note over BP,CP: Response Phase
    BP->>EBPF: Source: 20.0.0.2:80<br/>Dest: 10.0.0.5:50000

    EBPF->>LB: Lookup NAT entry<br/>using cookie 12345

    LB-->>EBPF: Found mapping:<br/>{addr:100.0.0.1, port:80}

    Note over EBPF,CP: Reverse NAT
    EBPF->>CP: Source: 100.0.0.1:80<br/>Dest: 10.0.0.5:50000

Detailed Implementation Steps

Step 1:

Pod Initiates Request When a client pod wants to communicate with a service, it sends a request to the service's ClusterIP (e.g., 100.0.0.1:80).

Packet Values (Outgoing from Pod):

Source IP: 10.0.0.5 (client pod)
Source Port: 50000 (assigned by the kernel)
Destination IP: 100.0.0.1 (service ClusterIP)
Destination Port: 80 (service port)
Socket: A UDP socket is created with a unique socket cookie, say 12345, obtained via bpf_get_socket_cookie.

Step 2:

The sendmsg syscall (for unconnected UDP) triggers an eBPF program of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR. Intercepts the request and performs several actions:

Checks if the destination IP is a service IP.
Selects a backend pod (e.g., 20.0.0.2:80) based on load balancing.
Rewrites the destination IP/port to the backend pod's IP/port.
Adds a reverse NAT entry to cilium_lb4_reverse_sk BPF map to handle reply traffic.
Key: {cookie: 12345, address: 20.0.0.2, port: 80, pad: 0}

Value: {address: 100.0.0.1, port: 80, rev_nat_index: 1}

Packet Values (After Rewrite, Sent to Backend):

Source IP: 10.0.0.5 (unchanged)
Source Port: 50000 (unchanged)
Destination IP: 20.0.0.2 (backend pod)
Destination Port: 80 (backend port)

Step 3:

Backend Responds The backend pod (20.0.0.2:80) sends a UDP response back to the client pod.

Packet Values (Reply from Backend):

Source IP: 20.0.0.2 (backend pod)
Source Port: 80 (backend port)
Destination IP: 10.0.0.5 (client pod)
Destination Port: 50000 (client's port)

Step 4:

The recvmsg syscall triggers an eBPF program to process the incoming reply packet

Actions performed by eBPF:

The program looks up cilium_lb4_reverse_sk using the key {cookie: 12345, address: 20.0.0.2, port: 80}. It finds the value {address: 100.0.0.1, port: 50000, rev_nat_index: 1}.and then rewrites the source IP/port to 100.0.0.1:50000, ensuring the client sees the response as coming from the service.

Packet Values (Delivered to Pod):

Source IP: 100.0.0.1 (service ClusterIP, rewritten)
Source Port: 80 (service port, rewritten)
Destination IP: 10.0.0.5 (client pod, unchanged)
Destination Port: 50000 (client's port, unchanged)

Below scenario can arise (Hence there is one more Step 5 to clean it):

Another socket (cookie 67890) connects to the same service, also hashing to the same bucket in cilium_lb4_reverse_sk causing Hash Collision Issue If multiple sockets connect to the same service, they might hash to the same bucket in the reverse NAT map, causing performance degradation. another socket (cookie 67890) connects to the same service, also hashing to the same bucket in cilium_lb4_reverse_sk (e.g., bucket 5, assuming hash(12345,IP,port) % 10 = 5 and hash(67890,IP,port) % 10 = 5

Then the map State would look like:

Bucket 5(Hash Maps are represented as buckets with linked list of {key:value}): [{12345,20.0.0.2,80}:{100.0.0.1,50000, 1}]--> [{67890 ,20.0.0.2,80}: {100.0.0.1,50000, 1}]

Problem:

Lookups for 12345 or 67890 require traversing the linked list in bucket 5, slowing down recvmsg operations.
With many stale entries, collisions become frequent, degrading performance.

Step 5:

The socket (12345) closes, triggering the eBPF program on cgroup/sock_release to deletes its entry from the reverse NAT map, cleaning up stale entries. So the eBPF program retrieves the socket cookie (12345) and deletes the entry from the map.

Updated Map State:

Bucket 5: [{67890, 20.0.0.2, 80} : {100.0.0.1, 50000, 1}]

Conclusion

Cilium in Kubernetes uses eBPF to implement a socket-level load balancer as part of the KubeProxy Replacement initiative. This approach redirects pod traffic to service endpoints efficiently, using components like socket cookies for connection tracking and LRU hash maps for storing NAT translations. The process involves intercepting socket events, handling TCP/UDP traffic, and managing reverse NAT for UDP. The article details the behind-the-scenes operations of this load balancing mechanism, highlighting potential issues like hash collisions and cleanup strategies to maintain performance.

Understanding Socket Load Balancing in Cilium: A Deep Dive in to how eBPF is used

Understanding Socket Load Balancing in Cilium: A Deep Dive in to how eBPF is used

How It Works

Example Scenario

Handling TCP and UDP

Reverse NAT and LRU Hashmaps

Load Balancing Flow

Detailed Implementation Steps

Conclusion

Subscribe to my newsletter

Cheithanya Pr

Cheithanya Pr