Kubernetes Learning Week Series 18

Nan SongNan Song
3 min read

kubernetes learning week series 17


Debugging a python slack bot in Production Hang

[Context]

Our company implemented one slack bot which was implemented with python programming language. This bot actively monitors alert notifications sent to designated Slack channels. Upon receiving an alert (triggered by Prometheus Alert Manager), it parses the message content to automatically query the Loki log server for relevant error logs or stack traces. The bot then posts Loki's response directly as a threaded reply in the same Slack channel, allowing developers to quickly assess urgency and prioritize actions without leaving their workflow.

[Issue Symptoms]

This slack bot has been running for long time, but earlier of this month we found it didn’t work as expected after couple hours. We need to restart it frequently, so my colleague and I had the interest to identify the root cause.

[Journey for identifying the root cause]

  1. Founding missing timeout and retrying logic when make loki query accordingly and loki response type checking.

     # Code Snippet
    
     import httpx
     import logging
     import time
    
     def make_loki_query(request):
         timeout = httpx.Timeout(connect=5.0, read=30.0)
         transport = httpx.HTTPTransport(retries=3)
    
         with httpx.Client(timeout=timeout, transport=transport) as client:
             for attempt in range(3):
                 try:
                     response = client.send(request)
                     status = response.status_code
    
                     response.raise_for_status()
    
                     try:
                         return response.json()
                     except ValueError as json_err:
                         logging.error(f"❌ Failed to parse Loki JSON: {json_err}, raw response: {response.text}")
                         return None
    
                 except httpx.RequestError as e:
                     logging.error(f"🌐 Network error on attempt {attempt + 1}: {e}")
                     time.sleep(2 ** attempt)  # Exponential backoff
                 except httpx.HTTPStatusError as e:
                     logging.error(f"🚨 HTTP error: {e.response.status_code} - {e.response.text}")
                     break
                 except Exception as e:
                     logging.error(f"💥 Unexpected error: {e}")
                     break
    
             logging.error(" All retries failed when querying Loki.")
             return None
    
  2. But after applied the changes, the situation didn’t get improved.

  3. Then trying to add py-spy tracing tool inherently but we needed to figure out the way to integrate it into the deployment file.

     apiVersion: apps/v1
     kind: Deployment
     metadata:
       ......
     spec:
       progressDeadlineSeconds: 600
       replicas: 1
       revisionHistoryLimit: 10
       selector:
         matchLabels:
           app: fp-bot
       strategy:
         rollingUpdate:
           maxSurge: 25%
           maxUnavailable: 25%
         type: RollingUpdate
       template:
         metadata:
           annotations:
             kubectl.kubernetes.io/restartedAt: "2025-04-03T17:44:21Z"
           creationTimestamp: null
           labels:
             app: fp-bot
         spec:
           containers:
           - env:
             ... ...
             image: <......>/cloud/fp-bot:temp-fix04031515
             imagePullPolicy: IfNotPresent
             name: fp-bot
             resources: {}
             securityContext:
               allowPrivilegeEscalation: true
               capabilities:
                 add:
                 - SYS_PTRACE
             terminationMessagePath: /dev/termination-log
             terminationMessagePolicy: File
             workingDir: /usr/src/app
    
  4. After the pod get restarted, using this way to diagnostic.

     kubectl exec -it <fp-bot-pod> bash -n slack-bot
    
     ps -ef | grep 'python app.py'
    
     ps-spy dump --pid <fp-bot pid>
    
  5. Finding out many threads got stalled like this

     Thread 40 (idle): "ThreadPoolExecutor-0_2"
         read (ssl.py:931)
         recv (ssl.py:1056)
         read (httpcore/backends/sync.py:28)
         _receive_event (httpcore/_sync/http11.py:192)
         _receive_response_headers (httpcore/_sync/http11.py:155)
         handle_request (httpcore/_sync/http11.py:91)
         handle_request (httpcore/_sync/connection.py:90)
         handle_request (httpcore/_sync/connection_pool.py:237)
         handle_request (httpx/_transports/default.py:218)
         _send_single_request (httpx/_client.py:1009)
         _send_handling_redirects (httpx/_client.py:973)
         _send_handling_auth (httpx/_client.py:939)
         send (httpx/_client.py:912)
         make_loki_query (lokiapiclient.py:63)
         getLogFromLoki (app.py:237)
         __retry_internal (retry/api.py:33)
         retry_decorator (retry/api.py:74)
         fun (decorator.py:232)
         dealQpsEvent (app.py:71)
         __retry_internal (retry/api.py:33)
         retry_decorator (retry/api.py:74)
         fun (decorator.py:232)
         handle_message_events (app.py:281)
         run_ack_function (slack_bolt/listener/custom_listener.py:55)
         run_ack_function_asynchronously (slack_bolt/listener/thread_runner.py:124)
         run (concurrent/futures/thread.py:57)
         _worker (concurrent/futures/thread.py:80)
         run (threading.py:870)
         _bootstrap_inner (threading.py:926)
         _bootstrap (threading.py:890)"
    
  6. Based on these observations, we suspected that our application was hitting Loki’s API rate limits. Upon reviewing the Loki server configuration, we confirmed that the maximum number of concurrent requests allowed was set to 32.

limits_config:
  max_query_parallelism: 32

[Solutions]

  1. Increasing max_query_parallelism from 32 to 64 or more, but it will increase pressure on the loki server. We might need to increase CPU and Memory.

  2. Adding client side rate limit control (Working Queue).


kubernetes learning week series 17

0
Subscribe to my newsletter

Read articles from Nan Song directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nan Song
Nan Song