Halyk Tech Sprints 2025: "Highload without load balancing and scaling,

This blog post reviews the tech talk presented by Oleg Ivakin “Highload without load balancing and scaling, like coffee without caffein“ at HalykTech Sprints 2025 event in Almaty, Kazakhstan.

This talk reviews what does highload means and why proper architecture, scaling, and load balancing are critical in the context of banking systems. Talk is a summary of observations and gained experience throughout managing the IT infrastructure of one of the largest Kazakhstan banks.

https://www.youtube.com/watch?v=YUb0sbK4uBk&t

What Highload Really Means

One of the first myths Oleg dispelled is that highload is simply about big numbers. A million daily authorizations or a hundred million payments per month may sound impressive, but they don’t all represent the same system pressure. These all are marketing KPIs.

For example, generating an access token is a lightweight operation. A payment transaction, on the other hand, involves anti-fraud checks, balance verification, and counterparty validation. The complexity of these operations is what defines “load.” Highload emerges when existing resources are no longer enough to reliably process requests — and at that point, buying a bigger server is only a temporary fix.

From Monoliths to Microservices

Looking back at the evolution of architectures, Oleg reminded us that growth was never solved by stronger hardware alone.

The client-server era concentrated everything in one powerful machine.
Three-tier systems separated application logic from databases but still had limits.

Image source

Service-oriented architectures (SOA) distributed responsibilities further.
Microservices took it to the next level: independent services, their own databases, their own scaling paths.

Image source

This journey was about architectural maturity, not just more GHz or RAM. Each step was driven by the need to handle more complex and heavier traffic.

The Role of Scalability

Scalability is the ability of a system to increase performance as load grows. There are two classical approaches: vertical and horizontal.

Vertical scaling means adding more CPU, RAM, or storage to a single machine. It works for a while, but there is always a ceiling. No matter how powerful the server, at some point it can’t absorb further growth.

Horizontal scaling is the more sustainable path. Instead of one giant machine, the load is distributed across many nodes, pods, or replicas. But in practice, horizontal scaling is not just about adding workers — it requires smart architectural patterns:

Caching to serve frequent queries quickly, reducing pressure on the database.
Read/write splitting with database replicas to offload heavy read traffic from the primary node.
Queues to decouple slow or complex operations from the request cycle.

This last point is crucial. Without queues, the system works in a serialized, synchronous manner: every request must be processed fully before the connection can be closed. That ties up server resources, increases response times, and creates bottlenecks.

By introducing queues, requests can be accepted asynchronously. The system acknowledges them, places them in a buffer, and frees the connection immediately. Background workers then process tasks at their own pace. This breaks the bottleneck of serialization, makes resource usage far more efficient, and ensures that even under sudden spikes the system remains responsive.

Image source

In banking, these techniques are not optional. They are survival strategies. When payrolls are credited or pensions are paid out, millions of users can hit the system simultaneously. Without caching, replicas, and queues, no amount of raw compute would save the infrastructure from collapsing. Scalability, in this sense, is not about comfort — it’s about keeping the business alive during peak load.

Load Balancing as Caffeine

Scaling by itself is useless if all requests still end up hitting the same node. This is where load balancing comes in — the caffeine that keeps a highload system awake and alert. Without it, the whole idea of horizontal scaling collapses.

At the simplest level, Layer 4 load balancing distributes connections across servers. It’s fast, efficient, and doesn’t look inside the request. But modern banking systems often need more intelligence. That’s where Layer 7 balancing comes in. Operating at the application layer, it can read HTTP headers, make routing decisions based on regions or user attributes, filter suspicious requests, or even rewrite headers before passing them on.

Behind the scenes, most load balancers — whether from NGINX, HAProxy, cloud providers like AWS or Azure, or enterprise appliances — rely on the same fundamental algorithms:

Round robin: distributing requests in a cycle so each server takes its turn.
Least connections: directing traffic to the node with the lightest load.
Resource-aware balancing: ensuring no server is overloaded while others remain idle.

Load balancers are also where rate limiting is enforced. Returning an HTTP 429 “Too Many Requests” may seem unfriendly, but it prevents backend systems from collapsing under excessive demand or even attack.

Image source

In effect, load balancing is not just about performance — it’s a form of protective shielding for your services.

In the bank’s architecture, load balancing sits at multiple layers: from the external edge filtering traffic before it hits the system, to internal HAProxy clusters splitting read and write queries across database replicas, to Kubernetes ingresses handling service-to-service traffic. Everywhere you look, the same principle applies: distribute the load, smooth out the spikes, and protect the weakest components from being overwhelmed.

Highload without balancing is like coffee without caffeine. You can still call it coffee, but it won’t give you the energy you need when things get serious.

Networking at Banking Scale

At the networking layer, Oleg highlighted how their bank ties together multiple geographically separate data centers into a single logical environment. Instead of treating each data center as an isolated island, they use a proprietary Cisco VXLAN fabric (via Cisco ACI) to unify them.

VXLAN (Virtual Extensible LAN) is a network overlay technology. It encapsulates traditional Layer 2 traffic into Layer 3 packets, effectively stretching a LAN across different physical locations.

Excellent explanation of VXLAN can be found here: What does VXLAN do?

In simpler terms, it allows servers and applications running in separate data centers to behave as if they were connected to the same local network — even though they may be hundreds of kilometers apart.

Cisco’s ACI implementation of VXLAN fabric adds centralized policy control, automation, and high redundancy. The result is a consistent, software-defined “fabric” that spans multiple data centers.

Within this fabric:

Control planes manage the overall state and policies.
Ingress points (leaf switches) accept and route incoming traffic.
Spine switches provide high-bandwidth connectivity across the entire mesh.
End nodes — virtual machines, databases, firewalls, routers — connect redundantly to ensure both load sharing and fault tolerance.

The analogy Oleg drew is striking: at this scale, the fabric begins to resemble a Kubernetes cluster. Controllers act like the Kubernetes control plane, ingresses handle external traffic, and workers (servers and network devices) process requests. Only here, the “pods” and “nodes” are physical and virtualized infrastructure rather than containers.

This design ensures that high availability, load distribution, and scalability principles aren’t just applied at the application layer, but are baked into the very foundation of the bank’s infrastructure.

Monitoring as the First Line of Defense

Even the most carefully scaled and balanced architecture will fail without monitoring. Metrics such as request volume, memory leaks, and bandwidth saturation must be tracked continuously.

The rule of thumb is to never run systems at 100%. Once resources are fully consumed, services stall, users leave, and panic spreads internally.

Instead, infrastructure should aim to stay below 80% utilization, leaving room for spikes. Thresholds and alerts must reflect this buffer.

A Real-World Example: Ticket Sales

To illustrate, Oleg showed the architecture of a ticket sales platform — a service that faced huge bursts of demand in Kazakhstan: Kino.kz

At the entry point sat a WAF and DDoS protection layer. Behind it, a load balancer (NGINX) distributed traffic across clusters. Databases (PostgeSQL, MongoDB) were managed by Patroni, ensuring high availability, while HAProxy handled read/write splitting. Kubernetes separated control, worker, and ingress layers, and caches (Redis, RabbitMQ) accelerated repeated queries. Harbor is used to store the images of the apps. Code is managed and deployed via GitLab CI.

Tech Stack Used in the Kino.kz

Layer / Function	Technology	Purpose
Security / Entry	WAF + Anti-DDoS service	Protects against malicious traffic and denial-of-service attacks.
Load Balancing	NGINX, HAProxy	NGINX for general traffic distribution; HAProxy for database read/write splitting.
Databases	PostgreSQL (Patroni), MongoDB	PostgreSQL with Patroni for high availability; MongoDB for unstructured data.
Caching / Messaging	Redis, RabbitMQ	Redis for fast in-memory caching; RabbitMQ for async messaging/queues.
Orchestration / Containers	Kubernetes (control, worker, ingress layers)	Manages containerized workloads, provides elasticity and scaling.
Container Registry	Harbor	Stores and manages Docker images for applications.
CI/CD Pipeline	GitLab CI	Automates build, test, and deployment of services.

This design avoided the trap of buying one oversized server and instead distributed load intelligently across specialized components.

Kubernetes and the Legacy Reality

The bank runs multiple Kubernetes flavors — vanilla, OpenShift, OKD. For services with constant transactional load, Kubernetes is a natural fit. It allows fast horizontal scaling when traffic spikes.

But Oleg was clear: not every system belongs in Kubernetes. Legacy banking platforms, especially card processing, remain monolithic and harder to scale. Internal applications with sporadic activity can run fine on simpler stacks. A hybrid world is the reality for banks with decades of history.

Security and DDoS

As the largest bank in the country, DDoS attacks are a fact of life. Protection is outsourced to specialized partners who scrub malicious traffic before it reaches the bank’s systems. This is another reminder that scalability is not just about performance, but also about resilience against deliberate overload.

Final Thoughts

Oleg’s core message was simple: don’t fall into the trap of chasing bigger servers. Highload is about designing the right architecture, forecasting where the pressure points will be, and applying balancing and scaling techniques wisely.

Load balancing and scalability are the caffeine of highload systems. Without them, services collapse into 503 errors, and both customers and engineers end up in panic. With them, even peak traffic can be absorbed smoothly.

For DevOps engineers, this talk reinforced timeless lessons: build for horizontal scaling, monitor relentlessly, expect DDoS, and respect the legacy systems that can’t be containerized away. In the end, highload isn’t a number — it’s the discipline of making complex operations reliable under pressure.

Halyk Tech Sprints 2025: "Highload without load balancing and scaling, like coffee without caffeine"

What Highload Really Means

From Monoliths to Microservices

The Role of Scalability

Load Balancing as Caffeine

Networking at Banking Scale

Monitoring as the First Line of Defense

A Real-World Example: Ticket Sales

Tech Stack Used in the Kino.kz

Kubernetes and the Legacy Reality

Security and DDoS

Final Thoughts

Subscribe to my newsletter

Maxat Akbanov

Maxat Akbanov