Kubernetes Learning Week Series 18


kubernetes learning week series 17
Debugging a python slack bot in Production Hang
[Context]
Our company implemented one slack bot which was implemented with python programming language. This bot actively monitors alert notifications sent to designated Slack channels. Upon receiving an alert (triggered by Prometheus Alert Manager), it parses the message content to automatically query the Loki log server for relevant error logs or stack traces. The bot then posts Loki's response directly as a threaded reply in the same Slack channel, allowing developers to quickly assess urgency and prioritize actions without leaving their workflow.
[Issue Symptoms]
This slack bot has been running for long time, but earlier of this month we found it didn’t work as expected after couple hours. We need to restart it frequently, so my colleague and I had the interest to identify the root cause.
[Journey for identifying the root cause]
Founding missing timeout and retrying logic when make loki query accordingly and loki response type checking.
# Code Snippet import httpx import logging import time def make_loki_query(request): timeout = httpx.Timeout(connect=5.0, read=30.0) transport = httpx.HTTPTransport(retries=3) with httpx.Client(timeout=timeout, transport=transport) as client: for attempt in range(3): try: response = client.send(request) status = response.status_code response.raise_for_status() try: return response.json() except ValueError as json_err: logging.error(f"❌ Failed to parse Loki JSON: {json_err}, raw response: {response.text}") return None except httpx.RequestError as e: logging.error(f"🌐 Network error on attempt {attempt + 1}: {e}") time.sleep(2 ** attempt) # Exponential backoff except httpx.HTTPStatusError as e: logging.error(f"🚨 HTTP error: {e.response.status_code} - {e.response.text}") break except Exception as e: logging.error(f"💥 Unexpected error: {e}") break logging.error("❌ All retries failed when querying Loki.") return None
But after applied the changes, the situation didn’t get improved.
Then trying to add py-spy tracing tool inherently but we needed to figure out the way to integrate it into the deployment file.
apiVersion: apps/v1 kind: Deployment metadata: ...... spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: fp-bot strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: kubectl.kubernetes.io/restartedAt: "2025-04-03T17:44:21Z" creationTimestamp: null labels: app: fp-bot spec: containers: - env: ... ... image: <......>/cloud/fp-bot:temp-fix04031515 imagePullPolicy: IfNotPresent name: fp-bot resources: {} securityContext: allowPrivilegeEscalation: true capabilities: add: - SYS_PTRACE terminationMessagePath: /dev/termination-log terminationMessagePolicy: File workingDir: /usr/src/app
After the pod get restarted, using this way to diagnostic.
kubectl exec -it <fp-bot-pod> bash -n slack-bot ps -ef | grep 'python app.py' ps-spy dump --pid <fp-bot pid>
Finding out many threads got stalled like this
Thread 40 (idle): "ThreadPoolExecutor-0_2" read (ssl.py:931) recv (ssl.py:1056) read (httpcore/backends/sync.py:28) _receive_event (httpcore/_sync/http11.py:192) _receive_response_headers (httpcore/_sync/http11.py:155) handle_request (httpcore/_sync/http11.py:91) handle_request (httpcore/_sync/connection.py:90) handle_request (httpcore/_sync/connection_pool.py:237) handle_request (httpx/_transports/default.py:218) _send_single_request (httpx/_client.py:1009) _send_handling_redirects (httpx/_client.py:973) _send_handling_auth (httpx/_client.py:939) send (httpx/_client.py:912) make_loki_query (lokiapiclient.py:63) getLogFromLoki (app.py:237) __retry_internal (retry/api.py:33) retry_decorator (retry/api.py:74) fun (decorator.py:232) dealQpsEvent (app.py:71) __retry_internal (retry/api.py:33) retry_decorator (retry/api.py:74) fun (decorator.py:232) handle_message_events (app.py:281) run_ack_function (slack_bolt/listener/custom_listener.py:55) run_ack_function_asynchronously (slack_bolt/listener/thread_runner.py:124) run (concurrent/futures/thread.py:57) _worker (concurrent/futures/thread.py:80) run (threading.py:870) _bootstrap_inner (threading.py:926) _bootstrap (threading.py:890)"
Based on these observations, we suspected that our application was hitting Loki’s API rate limits. Upon reviewing the Loki server configuration, we confirmed that the maximum number of concurrent requests allowed was set to 32.
limits_config:
max_query_parallelism: 32
[Solutions]
Increasing max_query_parallelism from 32 to 64 or more, but it will increase pressure on the loki server. We might need to increase CPU and Memory.
Adding client side rate limit control (Working Queue).
Subscribe to my newsletter
Read articles from Nan Song directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
