Why there is a need for Service Mesh

Consider a system with multiple services, such as:

Auth Service
Products Service
Shopping Cart Service
Order Service
Payment Service
Rating Service
Supplier Service

All these are microservices and they can communicate with each other. Connectivity between the services is one of the biggest challenges in microservices.

Let’s assume that our microservices are deployed in a Kubernetes cluster in EKS in one of the US regions. Now, there is a need to deploy the services in the Europe region as well. Now, this adds another layer of complexity, as we will need to manage the communication between services across different regions securely. Also, let’s assume that there are multiple databases, and they are being managed centrally by database engineers on separate VMs (since these databases are used and shared by the entire organization which makes it a legacy system, and it will be an enormous task to move it to a Kubernetes environment). So, the microservice application has to communicate with database services running on virtual machines in the on-premise data centers which makes it a hybrid environment. Managing communication in a hybrid environment is even more complex than managing communication within a homogenous environment.

Now let’s say, that one day AWS will have a major outage in most of the regions, which would result in the application becoming unresponsive and it will lead to losses. To overcome this kind of situation, a multi-cloud environment is used as a failover to another cloud provider (the entire Kubernetes application is replicated on GCP or Azure). This is done because the chances of two different cloud providers going down in the same regions at the same time are very low. However, this again introduces lots of challenges in terms of secure communication, network configuration, etc.

Service Mesh Technologies such as Consul come in as the communication channel or communication network layer between microservices, that solves many of the above-mentioned challenges.

What is a Service Mesh

Service Mesh like Consul is a whole infrastructure layer that has different features to solve these challenges of operating microservices and enabling communication between them across multiple environments.

How would it work without Consul?

Let's consider a city with different buildings and roads, and the apartment buildings have a bunch of residents in them. Each residents do their own tasks and sometimes they need information from other residents to complete their tasks, for which, they send messages to each other to communicate. Each resident has their own registry book which has the list of all the residents and their addresses (where they can send the messages).

Now, relate the entities in the following manner:

The city is similar to a Kubernetes Cluster
Buildings are Nodes
Apartments are Pods of each microservice
Residents are service containers
The registry book is like the config file for the applications in the container (which basically has service names and port numbers)

NOTE: In this situation, where we don't have a service mesh or Consule.

If a resident moves to another building, all the other residents would have to update their registry book with the updated address of this resident (otherwise they would be sending the messages to the wrong address).

This would typically happen when a microservice or a database service gets a new endpoint or the service name/port changes.

Now, the city is managed centrally (like someone is administering the Kubernetes cluster). So, the administrators want all communication data going through the services to be transparent and gathered in one place, so they can identify if there are any issues in communication between the residents and fix those issues accordingly.

Therefore, the residents would have to keep a protocol and report to the city’s monitoring service about their communications. This needs to be done along with the regular tasks of each resident, which can become overwhelming for a lot of residents. This is equivalent to adding a monitoring endpoint to expose metrics. This also has some limitations, as some residents can miss tracking some parts of the data, they may write in different formats, or they may write in unreadable handwriting, which would lead the central management to be unable to read the metadata about the communication.

How would the Consul optimize this situation?

To optimize the city, and release some workload from the residents and let them focus on their main task. As service administrators, we introduce a service mesh like consul. Basically, in every apartment, for every resident, we would be adding a personal assistant. These assistants are now saying to the resident “I will send all those messages for you, and you don’t need to know the exact address of the other residents you’re talking to. You just write their name on the envelope, I will find out where they live and I will deliver the message. When they reply back, I will receive the incoming message and will forward it back to you. I will also keep a protocol for any communication that goes through me”. So, all these administrative tasks around sending and receiving messages are taken care of by those assistants and the residents can focus on their main activity and the actual content of the message.

In Consul, these assistants are Envoy Sidecar Proxy Containers. Proxy containers are injected into each pod. Sidecar containers are helper containers that run alongside the main container in a pod.

How do these assistants do their job?

Instead of each resident having their own address book, the assistants have a shared address book in a shared network. So each assistant adds information about their resident or their service and how to talk to it in this central registry. Other assistants can read from this central registry. So when a new pod is scheduled with a new microservice, it gets assigned an assistant automatically and the proxy is automatically injected by the service mesh, and this new assistant updates the registry on the shared network. When a resident wants to deliver a message to another resident, it will simply ask its assistant to deliver the message to the <resident_name> (service-name). The assistant (proxy), will find the location of the service in the shared registry and it will send the message to the service address where the assistant (proxy) of that service will open the door and accept the message and then that assistant (proxy) will deliver it to the actual resident (microservice) inside the apartment (pod). So, essentially services don’t need to know each other’s endpoint. This information is abstracted by the Service Mesh. This is what is called a Dynamic Service Registry (where we don’t need to manually add/update the endpoints of services, it is done automatically).

Now, let's say, the residents get sick. In this case, the agent updates the shared registry that the resident is sick and is unable to send/receive messages for now. Once the resident is healthy again, the agent will update the registry again. This means that health status is available in real time in the registry.

Secure Networking - mTLS

While the agents are carrying messages back and forth between services in different buildings, let’s assume some hacker gets access to the city (Kubernetes Cluster). Now, these hackers are roaming on the streets freely and want to sniff these messages being sent between residents, especially if they contain private sensitive data. So, if the messages are not encrypted, the messages can be sniffed easily and information will be compromised. This is why the messages are encrypted, so even if the messages are sniffed, they will be encrypted and only the agents can decrypt the messages with their keys.

This is another feature that the Service Mesh offers. Encrypting messages end-to-end using mutual TLS ensures all service-to-service communication is authenticated and authorized. If we would not use a service mesh, then we would have to manually implement this logic in every service, manage the keys manually, and take care of all such overheads.

Authenticated Communication

Each service proxy gets its own certificate to establish a secure connection with other services. This individual certificate can also be used to uniquely identify the service and validate it’s identity. Think of it as each resident has their unique stamp, and when they send the message, the assistant puts the stamp on the envelope and encrypts it. When the receiving assistant receives the message, they can verify with the central registry regarding the authenticity of the stamp and can infer which service the message is coming from.

Control Traffic With Service Interventions

We can define rules as to which service can talk to which services. Once the identity is verified by the proxy, it validates if the communication is permitted. This is called Micro Network Segmentation (The practice of dividing a network into multiple segments or subnets that act as individual networks).

Observability

Once we have the communication rules in place, we can see which services are breaking the rules and trying to communicate with the services they are not supposed to communicate. We can also monitor which services are unhealthy, as well as the traffic statistics of different services. We can also check the error rates and identify the bottlenecks in the system. Since the proxies are in the same pods, they have rich telemetry data which can then be exposed to external monitoring services like Datadog or Prometheus. Also, since the proxies are envoy proxies, they send the data in a uniform fashion across services which makes it easy to put together the metrics from all of the services and build unified dashboards from them in Prometheus and Grafana. Therefore Service Mesh enables us to control things in the network without interfering with the applications.

Consul Servers

Imagine all of the assistants in the city work for the same company, and they have a headquarters office in the city in its own building, which is separate from the assistants. This office is the Consul server. We can have a single room in the building (like a single instance of the Consul server or a single pod replica), but if we’re managing multiple services and their proxies, we would need a bigger office (maybe 3-5 Consul server pods). These Consul servers push out all the necessary data to the proxies (or Consul clients) like the service registry info, communication rules, and configuration certificates.

Control Plane and Data Plane

The proxies have a network, upon which they can talk to each other and exchange information, and the network of proxies is called Data Plane. The cluster of Consul servers that manages the proxy network is called the Control Plane. The Data Plane is managed centrally by the Consul Control Plane. The Data Plane is only focussed on the communication, all the other things (services being added/updated/removed, certificates getting rotated, etc), are conveyed to the Data Plane by the Control Plane. The Control Plane lives separately (maybe in the same Kubernetes Cluster or maybe in a different Kubernetes Cluster)

Multi-Cluster

Let’s consider a scenario with multiple Kubernetes clusters containing our microservices (maybe in different geographic regions or maybe on different cloud platforms). In such cases, we have a Consul control plane in each cluster or we can have one dedicated cluster with Control Plane which will manage all other Consul data planes. Generally, the second approach (having a separate dedicated cluster for the Control Plane) is more common, since it avoids redundant resource allocation and configurations.

Consul is very powerful in such multi-cluster, multi-datacenter environments, connecting services across different environments which can be a really big networking and security challenge if being done without a service mesh tool.

Think of the Consul planting guards at the entry and exit of the city (Kubernetes Cluster). This guard is called a Mesh-Gateway. Mesh Gateways enable service mesh traffic to be routed securely between different Consul Data Centers in different clouds.

How Consul works with VMs

Considering the same analogy of the city, an on-premise data center will be a city with private houses where each house would be a VM. The application/service will be the only resident in the house. The Consul proxy will also be living in that house. There will be a dedicated house for the Consul Server. We can then configure a communication channel (which will have a mesh gateway in the VM) that can be used by the Consul server running on the VM to connect with the Consul server running in the Kubernetes cluster.

Introduction to Hashicorp Consul