Achieving True Zero Downtime with Kubernetes Rolling Updates: Myth or Reality?

Janit ChawlaJanit Chawla
7 min read

Introduction to Rolling Updates in Kubernetes:

Kubernetes makes deploying applications straightforward, allowing us to update them with a simple kubectl apply command. But what about downtime when updating the application pods?

This is where deployment strategies come into play, such as blue-green, canary, and rolling updates. By default, Kubernetes uses the rolling update deployment strategy.

Rolling updates promise to update our applications with zero downtime or should we say, near-zero downtime?

What is a Rolling Update?

A rolling deployment, also known as a ramped deployment strategy, replaces instances of the old application version with instances of the new application version, one instance at a time.

According to the Kubernetes documentation:

“A rolling update allows a Deployment update to take place with zero downtime. It does this by incrementally replacing the current Pods with new ones. The new Pods are scheduled on Nodes with available resources, and Kubernetes waits for those new Pods to start before removing the old Pods.”

The main objective of the rolling update deployment strategy is to reduce downtime and guarantee that applications stay accessible and operational throughout updates. This strategy is the default option for deployments, and you may be using it without realising it if you have not explicitly chosen another option in your deployment configuration.

Old pods are shutdown only after new pods of the new deployment version have been created and became ready to handle traffic.

The concept of Zero Downtime and Rolling Update:

Zero Downtime

Zero downtime refers to no disruption in service availability from the user’s perspective during updates, the end-user should not face any issue to access the application during update.

Why is it important ?

No business would like to disappoint its customers/consumers with service unavailability, even minimal downtime can result in lost revenue and user dissatisfaction.

Moreover, many applications and services operate in environments where service availability is critical. For example, e-commerce platforms, financial services, and communication tools all require high availability to meet user expectations and regulatory requirements.

By achieving zero downtime, businesses can deploy new features, security patches, and performance improvements without affecting the user experience, thereby staying ahead of competitors and continuously delivering value to customers.

Rolling Update

This process ensures that some instances of the application are always available, minimising downtime and providing a seamless transition for users.

Gradual Transition:

  • Controlled Rollout: Rolling updates allow for a controlled rollout, updating pods incrementally rather than all at once. This helps in identifying issues early and mitigating risks.

  • Resource Management: By updating one or a few pods at a time, rolling updates make efficient use of resources, avoiding the need for doubling the resource requirements temporarily.

Minimal Resource Overhead:

  • Efficient Use of Resources: Rolling updates do not require double the resources as blue/green deployments do. In blue/green deployments, you maintain two full environments (blue and green), which can be resource-intensive.

  • Cost-Effective: Especially in large-scale deployments, rolling updates can be more cost-effective as they don’t require running two parallel environments.

Simplicity and Integration:

  • Kubernetes Native: Rolling updates are natively supported and well-integrated into Kubernetes. The default deployment strategy for Kubernetes is rolling updates, making it simpler to configure and manage.

Working of Rolling Update

When there’s an update to an application and we apply those updates, Kubernetes, by default, uses the Rolling Update strategy and starts creating new pods with updated configurations.

To perform a rolling update, we need to define a Deployment object in our Kubernetes cluster, specifying which pods participate in the deployment and the current version of the application.

Here is a simple Deployment example. This runs 10 instances of nginx, using the container image nginx, and employs the RollingUpdate strategy:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 10
  selector:
    matchLabels:
      app: nginx
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

Under the rolling update strategy, there are two fields: maxSurge and maxUnavailable.

- MaxSurge: Specifies the maximum number of pods that a deployment can create at one time. This is equivalent to the deployment window. You can specify MaxSurge as an integer or as a percentage of the desired total number of pods. If it is not set, the default is 25%.

- MaxUnavailable: Specifies the maximum number of pods that can be taken offline during the rollout. It can be defined either as an absolute number or as a percentage.

At least one of these parameters must be larger than zero.

With this, the rolling update is done. Or is it? Are these configurations enough for zero downtime?

The short answer is NO.

In a rolling update, the switch from the old to the new version is not always smooth. The application might lose some client requests or break some ongoing requests because pod termination is not handled gracefully.

Process of Pod Termination

Sequence in which pod termination start

  1. Pod Marked for Termination:
  • The pod is marked for termination, and its status is updated to Terminating.

2. Grace Period Start:

  • Kubernetes waits for a grace period defined by the terminationGracePeriodSeconds parameter (default is 30 seconds). This period allows the application to gracefully handle ongoing requests.

3. PreStop Hook Execution:

  • If the pod has a preStop lifecycle hook, Kubernetes executes this hook. This allows the application to perform any necessary cleanup tasks before the pod is terminated.

4. SIGTERM Signal Sent:

  • Kubernetes sends a SIGTERM signal to the pod’s containers, signaling them to begin a graceful shutdown process. Well-behaved applications should handle this signal and start shutting down gracefully.

5. Endpoint Removal:

  • Kubernetes begins the process of removing the pod’s IP address from service endpoints. This involves updating the endpoints in kube-proxy and other components like ingress controllers that manage traffic. This step ensures that new traffic is not routed to the terminating pod.

6. Traffic Draining:

  • During the grace period, the pod continues to handle ongoing requests but is no longer sent new requests. This allows existing connections to complete.

7. SIGKILL Signal (if necessary):

  • If the pod does not shut down within the grace period, Kubernetes sends a SIGKILL signal to forcefully terminate the containers. This is a last resort to ensure that the pod is eventually stopped.

8. Pod Deletion:

  • Once the containers have stopped, the pod is removed from the cluster.

ISSUES!!!

Despite the intention that the pod should not accept new connections during the grace period, the following issues can arise:

1. Slow Endpoint Updation:

  • Delay in Endpoint Updates: There can be delays in updating the list of endpoints across all components (e.g., kube-proxy, ingress controllers). If the components do not receive the updated endpoint information promptly, they might still route traffic to the terminating pod.

2. Ingress Controller Delays:

  • Routing Table Updates: Ingress controllers might not immediately update their routing tables to reflect the removal of the terminating pod. During this delay, new traffic could still be directed to the pod that is supposed to be terminating.

Ideally, Kubernetes should wait for all components in the cluster to have an updated list of endpoints before deleting the pod. However, Kubernetes doesn’t work like this.

So to make this happen we need to make changes

Enhancing Rolling Updates

Readiness Probes:

This step ensures that our application is ready to serve request before its exposed to user. The probes check the state of pods and allow for rolling updates to proceed only when all of the containers in a pod are ready. Pods are considered ready when the readiness probe is successful

Graceful termination period:

The graceful termination period can only work if the application is able to intercept the SIGTERM!

If the app is not coded to intercept SIGTERM, it will just hard kill the application, either or not there is a graceful termination period greater than 30 sec which may lead to data loss.

Proper container handling:

It is important that our containers are configured to handle termination signal correctly, this means ensuring application gracefully shuts down when it receives unix SIGTERM signal

Handling SIGTERM is not enough, in many cases even though application handles SIGTERM the loadbalancer still ends up sending traffic to terminated pods

Prestop lifecycle hook:

To address this issue where pod termination don’t wait for loadbalancers to be reconfigured, this hook operates synchronously, meaning it must finish its tasks before final termination signal (SIGTERM) is sent to container, we can use hook to introduce brief waiting period before SIGTERM stops application process

Conclusion

While rolling updates are designed to minimise disruptions by incrementally replacing old pods with new ones, achieving true zero downtime requires careful configuration and handling.

Key practices include:

  1. Readiness Probes: Ensure new pods are ready to handle traffic before being marked as available.

  2. Graceful Termination: Configure applications to handle the SIGTERM signal properly and set an adequate termination grace period.

  3. PreStop Lifecycle Hook: Use the PreStop hook to ensure the application completes necessary tasks before shutdown.

Despite Kubernetes built-in capabilities, factors such as slow endpoint updates and ingress controller delays can still pose challenges. By understanding and addressing these issues, you can leverage rolling updates to maintain high availability and provide a seamless experience for your users.

0
Subscribe to my newsletter

Read articles from Janit Chawla directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Janit Chawla
Janit Chawla