Fixing ALB Unhealthy Targets Caused by OpenSearch Restarts

NaveenKumar VRNaveenKumar VR
10 min read

🔍 Before We Begin

Before diving into the actual problem and its solution, let’s take a moment to understand what OpenSearch is and where it's commonly used.

📌 What is OpenSearch? Where is it Used?

OpenSearch is an AWS-managed service based on the open-source fork of Elasticsearch. It offers powerful capabilities for indexing, searching, and analyzing large volumes of data in near real time.

OpenSearch is widely used in infrastructure for a variety of use cases, including:

  • Log analytics

  • Full-text search

  • Application performance monitoring

  • Security information and event management (SIEM)

In recent years, many companies managing their infrastructure on AWS have been shifting towards AWS-managed services. This transition helps reduce the operational burden of managing infrastructure, allowing IT teams to focus on solving business-critical problems rather than maintaining and scaling services themselves.

AWS OpenSearch has become one of the most widely adopted services in this category, providing a fully managed, scalable, and secure alternative to running Elasticsearch on self-managed instances.

⚙️ How is OpenSearch Configured?

When using AWS OpenSearch Service, you typically configure the following key components:

🌐 Exposing OpenSearch to External Networks Using ALB

When you create an AWS-managed OpenSearch cluster, AWS provides a domain endpoint that is accessible only from within the same VPC, depending on your VPC settings and security group rules. However, resources or users outside the VPC—such as developers, third-party services, or monitoring tools—cannot access this endpoint directly.

In real-world setups, it's quite common for some users or systems to live outside the VPC. For example, accessing the OpenSearch Dashboard from an external network becomes difficult since direct connectivity isn’t allowed.

To solve this problem, we can expose OpenSearch securely through an AWS Application Load Balancer (ALB).

  1. Create an ALB
    An internet-facing ALB is created to act as a public entry point. This ALB provides a public DNS name that external users and resources can access.

  2. Register OpenSearch Node in a Target Group
    We then create a Target Group and register the IP address of the OpenSearch node behind the domain endpoint.

    🔹 Note: Even if your OpenSearch domain has multiple nodes, AWS internally exposes only one node behind the endpoint. Other nodes are used for redundancy and failover.

  3. Connect Target Group to ALB
    The Target Group is attached to the ALB. This setup ensures that whenever a request hits the ALB, it is forwarded to the registered OpenSearch node in the Target Group.

  4. Route 53 for Domain Management
    AWS Route 53 manages the public DNS name of the ALB. If you're using HTTPS, SSL certificates issued via ACM (AWS Certificate Manager) are also managed and linked here.

This setup enables external access to OpenSearch in a secure and controlled way. The ALB acts as a bridge between the external network and the private OpenSearch cluster inside the VPC.

⚠️ The Problem Statement: Why the ALB Connection Breaks

Here comes the real challenge.

In most setups, we do not maintain static (sticky) IP addresses for OpenSearch cluster nodes. When an OpenSearch cluster restarts—either due to scaling, upgrades, or internal AWS maintenance—the IP address of the node exposed via the endpoint can change.

Now, since the ALB target group is manually registered with the IP of the previously exposed node, it still tries to forward traffic to that old, now-invalid IP. But OpenSearch is now responding from a new node with a different IP. And here's the problem:

  • The ALB doesn’t have any native integration with OpenSearch to update its target group dynamically.

  • As a result, the target group points to a stale IP, and the target becomes unhealthy, breaking the connection between the ALB and OpenSearch.

This leads to downtime or failed requests from all external clients relying on the ALB for access.

🔁 The Need for Automation

To prevent this, we need a robust, automated mechanism to keep the ALB target group in sync with the currently exposed IP of the OpenSearch endpoint.

There are several ways to implement this, and in the next section, I’ll walk you through one of the approaches I recently implemented in a production environment.

✅ Quick Summary: How I Solved the ALB & OpenSearch Restart Issue

Before we dive into the detailed steps, here’s a quick overview of how this issue was resolved.

Assumption: You already have an OpenSearch cluster running and exposed via an ALB to allow access from external networks.

The core idea is to detect when the OpenSearch cluster restarts and automatically update the ALB target group with the currently active OpenSearch node IP. This ensures that the ALB always routes traffic to a healthy target, avoiding downtime or broken access for external users.

To achieve this:

  • I set up a CloudWatch alarm that monitors the number of nodes in the OpenSearch cluster.

  • Using EventBridge, I track when the alarm transitions from an ALARM state back to OK, indicating that the cluster has recovered after a restart.

  • This triggers a Lambda function, which:

    • Fetches the active OpenSearch node IP.

    • Updates the ALB target group with the new IP.

    • Removes any stale IPs no longer part of the cluster.

This automation ensures that your ALB always points to a healthy OpenSearch node, even after a restart — maintaining uninterrupted access for external systems.

In the next section, I’ll walk you through each step of this setup with screenshots and configuration details.

🛠️ Detailed Procedure:

The steps outlined below can be implemented using any Infrastructure-as-Code (IaC) or configuration management tool of your choice. In this guide, the focus is on what needs to be implemented rather than how it's implemented with a specific tool. You’re free to use the tool or framework that best fits your environment — this guide should help you understand the logic and flow regardless of the platform.

🔹 Step 1: Get the Total Number of Nodes in Your OpenSearch ClusterYou can find the details in Cloud watch

The first step is to identify the total number of nodes (both master and data nodes) in your OpenSearch cluster. This information is essential for setting up a reliable CloudWatch alarm.

You can find this detail using Amazon CloudWatch:

  1. Log in to the AWS Management Console.

  2. In the search bar, type and select CloudWatch.

  3. From the left-hand menu, click on "All metrics."

  4. claybrainer-Cloudwatch-allmetrics

    In the search bar, type “ES” to filter OpenSearch-related metrics, From the results, select “ES → Per-Domain, Per-Client Metrics.”

  5. Choose your OpenSearch domain and look for the Nodes metric. This will show the number of nodes currently active in your cluster.

  6. Check the checkbox for Nodes corresponding to your specific OpenSearch domain. This will display a graph showing the number of active nodes over time.

  7. While there are certainly easier ways to retrieve the total number of nodes in an OpenSearch cluster, this approach has a key advantage — it allows us to directly create a CloudWatch alarm based on this metric.

  8. So, as you might have guessed — the next step is to click on “Create alarm.”

  9. Configure the alarm setting as follow

  10. In the next step, click “Remove” under the notification section to delete any default SNS notification settings — since we’ll be using EventBridge to trigger the action instead of relying on SNS alerts.

  11. Add your preferred name to the Alarm

  12. Finally Click on Create Alarm

  13. Now, follow Steps 3 to 5 in the alarm creation flow. Once completed, you should see 1 alarm listed under your OpenSearch domain name in the CloudWatch metrics dashboard.

  14. Select the Node Checkbox, and then select the Graphed Metrics

  15. Select Add Math » Conditional » Equals

  16. From the expression click Edit icon

  17. Change the value to m1 == <your total number of nodes>, where m1 refers to the metric ID for Nodes in your alarm, which you can find in the same menu under ID, and <your total number of nodes> is the value you noted in Step 8. Click on Apply

  18. Voila! The alarm is now set. It will return a value of 1 when the current node count is exactly equal to the expected total (in our case, 6), and 0 when it’s either greater or less than that.

  19. This is exactly what we need — because any change in node count usually indicates that something has happened with the OpenSearch cluster (such as a restart or scaling event). At this point, we need to verify the ALB Target Group IP and update it if it no longer points to a healthy OpenSearch node.

  20. Now, let’s move on to setting up EventBridge to trigger a Lambda function whenever this happens.

🔹 Step 2: Configure a Lambda Function to Update the Target Group with the OpenSearch Endpoint Node IP

There are plenty of resources — including official documentation and YouTube tutorials — that can guide you through how to create and configure a Lambda function.

In this section, I’ll focus on what matters most for this use case:

  • The runtime/language you should select for your Lambda. - Python:3.9 and above

  • The script you’ll use to update the Target Group.

  • The environment variables that need to be configured for the Lambda to work effectively.

    • OPENSEARCH_HOST:

      • You can get this value from, Login in to console » Search Opensearch » Click Domains » Select your Domain » You can find this value under VPC EndPoint

      • (Remove “https://” when adding it in env variable)

    • TARGET_GROUP_ARN

      • You can get this value from Login in to console »Search Target Group » Select your Target Group » You can find this value under ARN
  • Script to use

    Note: I’ve hard coded the port to 443 replace it with your port number

import boto3
import os
import socket

elb = boto3.client('elbv2')

TARGET_GROUP_ARN = os.environ['TARGET_GROUP_ARN']
OPENSEARCH_HOST = os.environ['OPENSEARCH_HOST']

def lambda_handler(event, context):
    try:
        # Resolve new IP of OpenSearch
        new_ip = socket.gethostbyname(OPENSEARCH_HOST)
        print(f"Resolved OpenSearch IP: {new_ip}")

        # Describe current targets
        current_targets = elb.describe_target_health(TargetGroupArn=TARGET_GROUP_ARN)
        registered_ips = [t['Target']['Id'] for t in current_targets['TargetHealthDescriptions']]
        print(f"Currently registered IPs: {registered_ips}")

        #Deregister old IPs
        for ip in registered_ips:
            if ip != new_ip:
                print(f"Deregistering IP: {ip}")
                elb.deregister_targets(
                    TargetGroupArn=TARGET_GROUP_ARN,
                    Targets=[{"Id": ip, "Port": 443}]
                )

        # Register new IP if not already registered
        if new_ip not in registered_ips:
            print(f"Registering new IP: {new_ip}")
            elb.register_targets(
                TargetGroupArn=TARGET_GROUP_ARN,
                Targets=[{"Id": new_ip, "Port": 443}]
            )

        return {"status": "Success", "new_ip": new_ip}

    except Exception as e:
        print(f"Error: {str(e)}")
        raise e
  • Make sure you have necessary permission for this Lambda function to, if you don’t have create an IAM role with below access.

    • Read the VPC Endpoint configuration settings from Opensearch

    • Ability to add Targets to the Target group

    • Ability to Selete/Drain targets from the target group

  • Once all are set test it manually invoking the lambda function

🔹 Step 3: Configuring Event Bridge to trigger Lambda

  1. Log in to the AWS Management Console.

  2. In the search bar, type and select Amazon EventBridge » From left menu Click Rules » Select Create rule

  3. Give preferred rule name and description and Click Next

  4. Click Edit Pattern

  5. Replace the content what ever needed as mentioned below and paste the content in the code snippet.

    - YOUR ALARM NAME » Replace this with the Alarm name which you created in Step 1

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "detail": {
    "alarmName": ["<YOUR ALARM NAME>"],
    "state": {
      "value": ["ALARM", "OK"]
    },
    "previousState": {
      "value": ["OK", "ALARM"]
    }
  }
}
  1. Explanation of the condition

    1. "source": ["aws.cloudwatch"]
      Ensures the event is coming from Amazon CloudWatch.

    2. "detail-type": ["CloudWatch Alarm State Change"]
      Captures only events related to alarm state changes.

    3. "alarmName": ["<YOUR ALARM NAME>"]
      Filters the event to trigger only for the specific alarm you created to monitor OpenSearch node count.

    4. "state": { "value": ["ALARM", "OK"] }
      Triggers the rule when the alarm enters from the ALARM to OK state.

    5. "previousState": { "value": ["OK", "ALARM"] }
      Ensures the transition is meaningful — for example, from OK ➝ ALARM and then ALARM ➝ OK — avoiding unnecessary triggers.

  2. Click Next to save your pattern

  3. Then For the target

    1. Target Type : AWS Service

    2. Select a target » Lambda Function

    3. Function » Select the function you created on Step 2 from drop down

    4. Leave the other box as it is and Click Next

  1. In the next step add the necessary tag if required and click Next and Finally review the configuration and click on Create Rule

Voila! This concludes the overall configuration for automating the update of OpenSearch Node IPs in the Target Group.

With this setup in place, your Target Group will always stay up to date with the active and healthy IP address of your OpenSearch cluster — ensuring seamless connectivity even during restarts or node changes.

0
Subscribe to my newsletter

Read articles from NaveenKumar VR directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

NaveenKumar VR
NaveenKumar VR