Debugging a python slack bot in Production Hang

[Context]

Our company implemented one slack bot which was implemented with python programming language. This bot actively monitors alert notifications sent to designated Slack channels. Upon receiving an alert (triggered by Prometheus Alert Manager), it parses the message content to automatically query the Loki log server for relevant error logs or stack traces. The bot then posts Loki's response directly as a threaded reply in the same Slack channel, allowing developers to quickly assess urgency and prioritize actions without leaving their workflow.

[Issue Symptoms]

This slack bot has been running for long time, but earlier of this month we found it didn’t work as expected after couple hours. We need to restart it frequently, so my colleague and I had the interest to identify the root cause.

[Journey for identifying the root cause]

Founding missing timeout and retrying logic when make loki query accordingly and loki response type checking.

 # Code Snippet

 import httpx
 import logging
 import time

 def make_loki_query(request):
     timeout = httpx.Timeout(connect=5.0, read=30.0)
     transport = httpx.HTTPTransport(retries=3)

     with httpx.Client(timeout=timeout, transport=transport) as client:
         for attempt in range(3):
             try:
                 response = client.send(request)
                 status = response.status_code

                 response.raise_for_status()

                 try:
                     return response.json()
                 except ValueError as json_err:
                     logging.error(f"❌ Failed to parse Loki JSON: {json_err}, raw response: {response.text}")
                     return None

             except httpx.RequestError as e:
                 logging.error(f"🌐 Network error on attempt {attempt + 1}: {e}")
                 time.sleep(2 ** attempt)  # Exponential backoff
             except httpx.HTTPStatusError as e:
                 logging.error(f"🚨 HTTP error: {e.response.status_code} - {e.response.text}")
                 break
             except Exception as e:
                 logging.error(f"💥 Unexpected error: {e}")
                 break

         logging.error("❌ All retries failed when querying Loki.")
         return None

But after applied the changes, the situation didn’t get improved.

Then trying to add py-spy tracing tool inherently but we needed to figure out the way to integrate it into the deployment file.

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   ......
 spec:
   progressDeadlineSeconds: 600
   replicas: 1
   revisionHistoryLimit: 10
   selector:
     matchLabels:
       app: fp-bot
   strategy:
     rollingUpdate:
       maxSurge: 25%
       maxUnavailable: 25%
     type: RollingUpdate
   template:
     metadata:
       annotations:
         kubectl.kubernetes.io/restartedAt: "2025-04-03T17:44:21Z"
       creationTimestamp: null
       labels:
         app: fp-bot
     spec:
       containers:
       - env:
         ... ...
         image: <......>/cloud/fp-bot:temp-fix04031515
         imagePullPolicy: IfNotPresent
         name: fp-bot
         resources: {}
         securityContext:
           allowPrivilegeEscalation: true
           capabilities:
             add:
             - SYS_PTRACE
         terminationMessagePath: /dev/termination-log
         terminationMessagePolicy: File
         workingDir: /usr/src/app

After the pod get restarted, using this way to diagnostic.

 kubectl exec -it <fp-bot-pod> bash -n slack-bot

 ps -ef | grep 'python app.py'

 ps-spy dump --pid <fp-bot pid>

Finding out many threads got stalled like this

 Thread 40 (idle): "ThreadPoolExecutor-0_2"
     read (ssl.py:931)
     recv (ssl.py:1056)
     read (httpcore/backends/sync.py:28)
     _receive_event (httpcore/_sync/http11.py:192)
     _receive_response_headers (httpcore/_sync/http11.py:155)
     handle_request (httpcore/_sync/http11.py:91)
     handle_request (httpcore/_sync/connection.py:90)
     handle_request (httpcore/_sync/connection_pool.py:237)
     handle_request (httpx/_transports/default.py:218)
     _send_single_request (httpx/_client.py:1009)
     _send_handling_redirects (httpx/_client.py:973)
     _send_handling_auth (httpx/_client.py:939)
     send (httpx/_client.py:912)
     make_loki_query (lokiapiclient.py:63)
     getLogFromLoki (app.py:237)
     __retry_internal (retry/api.py:33)
     retry_decorator (retry/api.py:74)
     fun (decorator.py:232)
     dealQpsEvent (app.py:71)
     __retry_internal (retry/api.py:33)
     retry_decorator (retry/api.py:74)
     fun (decorator.py:232)
     handle_message_events (app.py:281)
     run_ack_function (slack_bolt/listener/custom_listener.py:55)
     run_ack_function_asynchronously (slack_bolt/listener/thread_runner.py:124)
     run (concurrent/futures/thread.py:57)
     _worker (concurrent/futures/thread.py:80)
     run (threading.py:870)
     _bootstrap_inner (threading.py:926)
     _bootstrap (threading.py:890)"

Based on these observations, we suspected that our application was hitting Loki’s API rate limits. Upon reviewing the Loki server configuration, we confirmed that the maximum number of concurrent requests allowed was set to 32.

limits_config:
  max_query_parallelism: 32

[Solutions]

Increasing max_query_parallelism from 32 to 64 or more, but it will increase pressure on the loki server. We might need to increase CPU and Memory.
Adding client side rate limit control (Working Queue).

kubernetes learning week series 17

Kubernetes Learning Week Series 18

Debugging a python slack bot in Production Hang

Subscribe to my newsletter

Nan Song

Nan Song