Enhance GKE Workload Security with Service Mesh

This is the eighth post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to start there.

In the last 2 posts we looked at how GKE Service Mesh can help us with things like multi-cluster networking and fleet ingresses, and in the final part of our tour of this particular feature we’ll look at the additional security features a mesh provides; this can be one of the most compelling reasons to implement a mesh in the first place. Historically, many security design decisions ended at the boundary of a single server or a single monolithic application. This was also true for networking components, where entities within a specific network boundary (such as a firewall or VPN) were trusted and not subject to further rigorous security measures.

Cloud native designs change these approaches completely. Organizations including Google, the Cloud Native Computing Foundation and even the US Department of Defense all recommend zero-trust approaches to security, where service identity must always be proven, and permissions granted on a least-privilege basis. This is important when migrating monolithic applications to microservices, as this increases the number of moving parts and thus the attack vector size. Individual services may be compromised and impersonate other services on a network, either disrupting applications or exposing sensitive information.

So, we need a way to make sure that all service interactions are authenticated, authorized and controlled. Luckily, this is something the Service Mesh excels at. In this post we’re going to cover the following main topics:

Implementing mutual transport layer security to encrypt all traffic
Adding service account identities to our workloads
Controlling which services and namespaces can access each other

As we’ve covered the bulk of how Service Mesh works in previous posts, this will be a lighter tour through these topics as they are engineered well into Service Mesh and easy to implement. At the end of the post, we’ll summarize what we’ve learned about Service Mesh, and use these topics to springboard into further security concepts.

Service identity and end-to-end encryption

Let’s start by discussing why securing individual workloads such as microservices is so important. Many people think that performing security up to the boundary of their cloud infrastructure is sufficient. Perhaps you’re terminating TLS connections at your load balancer, or even using a Web Application Firewall to provide additional protections. Once the traffic is inside your network, you may be assuming it is safe. But as our design for applications becomes more abstract and granular, we are introducing some more complexity which in turn leads to a wider attack vector. If a bad actor can infiltrate some vulnerable code in one part of your application stack, what’s to stop them impersonating any other part to steal or damage your data?

A Service Mesh allows us to secure an environment composed of multiple microservices by adding a secure identity to each of our workloads. In this scenario, a single workload may be a Pod, or a collection of replica Pods managed by some kind of controller such as a Deployment or StatefulSet. The workload may represent some part of an overall system, such as the checkout, cart or frontend components we saw in the Online Boutique demo earlier in this series of posts. In a typical cluster, service discovery means that any of these workloads can find and connect to any other workload. Typically, this internal traffic is not encrypted either. That means that if just one component becomes compromised, an attacker may be able to intercept data intended for other components.

To illustrate this, let’s imagine that our frontend workload has been compromised. As it uses the same CIDR as other workloads, it could potentially intercept their network traffic. Attack code could even be run on this frontend to simulate a checkout service and infiltrate the payment service backend.

The first part of remediating this is to encrypt the data sent between individual microservices. Historically this would have introduced a significant management burden, creating, assigning and distributing TLS certificates to each microservice, and making sure that each service trusts the certificate chain used to issue them. But now, Service Mesh can automate this entirely by simply enabling the policy and leveraging Google’s Mesh CA service. We’ll see how to do this in a moment.

The second part of our approach is to be descriptive with which services should be allowed to communicate with which other services, and in what ways. Now that your microservices are encrypted with their own client certificate, they will all have their own unique, provable identity. That means we can create rules to lock down communication between microservices to only the interactions that we know should be happening, therefore reducing the attack surface for bad actors.

Best of all, the implementation of these new security endeavors is all handled by the Envoy sidecar container, so in most cases no changes to actual application containers are required. Let’s look at how we set these features up in our clusters.

Enabling secure two-way communication

Mutual TLS (or mTLS) is implemented by Istio in Cloud Service Mesh using the Envoy proxy sidecar containers running alongside your workloads. mTLS has two operating modes:

Permissive: Sidecar proxies will use TLS to connect with other sidecar proxies but will also allow non-TLS communication for incoming connections or connections to other workloads without sidecars.
Strict: Only TLS connections will be allowed.

By default, Cloud Service Mesh will enable permissive mTLS across your cluster out-of-the-box. However, it’s recommended to switch to strict mode to secure your mesh. This configuration is done with a PeerAuthentication object in the istio-system namespace. Istio considers this the root of your mesh and will apply the configuration to every namespace where Istio injection is enabled.

To secure your entire mesh, you could apply the object with the following YAML:

apiVersion: "security.istio.io/v1beta1"
kind: "PeerAuthentication"
metadata:
  name: "mesh-wide-mtls"
  namespace: "istio-system"
spec:
  mtls:
    mode: STRICT

Alternatively, you could use the above YAML to apply a strict mTLS policy to a specific namespace by simply changing the namespace referenced, or removing it entirely from the YAML and adding it dynamically with kubectl -n, which can be useful for scripting.

In some scenarios, we may want to enable strict mTLS only for specific workloads rather than an entire namespace. We can do this by adding selectors to the PeerAuthentication object so that it only targets specific Pods.

For example, if our Pods have the metadata label app=frontend, we could use the following YAML definition:

apiVersion: "security.istio.io/v1beta1"
kind: "PeerAuthentication"
metadata:
  name: "frontend-mtls"
  namespace: "frontend"
spec:
  selector:
    matchLabels:
      app: frontend
  mtls:
    mode: STRICT

However, Cloud Service Mesh can’t aggregate workload-level policies for outbound mTLS traffic to a service, so we also need to add a matching DestinationRule:

apiVersion: "networking.istio.io/v1alpha3"
kind: "DestinationRule"
metadata:
  name: "frontend-dr-mtls"
spec:
  host: "frontend.demo.svc.cluster.local"
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL

With these objects in place, incoming requests to the frontend service will now require TLS. We’ll need to be mindful of what incoming connections we expect as unencrypted requests will now be rejected. Usually, we rely on an Ingress gateway to accept a request from an external user (either directly or via a Cloud Load Balancer). The Ingress Gateway Pods would then require mTLS to communicate securely with the frontend.

Using policies to authorize connections

As we’ve already discussed, encrypting network traffic between workloads is just one part of the solution. However, in forcing the use of certificates for mTLS, we can now also trust that the originator of a request is who they say they are. For example, a proxy accepting requests from the frontend workload over TLS can trust that the request really is from frontend because of the certificate being used.

Once we’ve authenticated a request with its certificate, Cloud Service Mesh also allows us to create rules regarding authorization. In other words, we trust the identity of this workload, but do we want to allow them to make the connection?

In a microservices architecture, authorization policies can be useful to control which workloads are allowed to communicate with which other workloads. Remember, in the ephemeral world of Kubernetes, our Pods will have constantly changing IP addresses, so traditional firewall rules simply no longer make sense. Instead, we can use an AuthorizationPolicy to define a set of rules for traffic that we will permit, based on either a workload or namespace identity.

The structure of an AuthorizationPolicy

The AuthorizationPolicy object is a custom resource which, when applied to our cluster, will create rules on all of the affected sidecar proxies in scope. An individual policy object can target the entire mesh, a namespace or an individual workload using selectors in a similar fashion to the PeerAuthentication object we saw earlier.

Policies are then comprised of an action and optionally (although usually!) some rules:

The action is usually set to ALLOW or DENY the matching request. As we’ll see in a moment, there’s a slightly counter-intuitive way to build sets of rules using multiple ALLOWs without any DENYs for most use cases. In advanced use cases, the action can be CUSTOM which allows you to delegate the access control to an external authorization system. Finally, you can also specify an action of AUDIT, which causes the request to simply be logged and has no bearing on whether the connection is allowed or not. Normally, audit rules are applied in addition to allow and deny rules to assist with troubleshooting.
The rules define which requests will match and should be affected by the action. Within the object’s rules, we can specify the source of the request in the from section. The to section allows us to specify which operations should be permitted (such as HTTP GET, POST etc.). It may feel like the to section should specify the target of our policy, but don’t forget that we specify this in a selector, if we’re trying to target a specific workload. Finally, we can apply some additional conditions in a when field, including request headers that must match for us to apply the policy.

Let’s see some example AuthorizationPolicy objects to illustrate how these conventions work. Here’s a basic policy to allow all requests to Pods that match the selector app=frontend:

apiVersion: "security.istio.io/v1beta1"
kind: "AuthorizationPolicy"
metadata:
  name: "frontend-ap"
  namespace: frontend
spec:
  selector:
    matchLabels:
      app: frontend
  action: ALLOW

This is obviously a very permissive policy, so we could consider locking it down a bit by adding some approved operations. In the updated version below, we’ll specify that we only accept HTTP GET requests for paths that start with /public or /test:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: tester
  namespace: default
spec:
  selector:
    matchLabels:
      app: products
  action: ALLOW
  rules:
  - to:
    - operation:
        methods: ["GET"]
        paths: ["/public/*", "/test/*"]

Matching keywords like paths can also be used with their negative condition versions, which in this example would be notPaths. Including a negative condition in a policy would mean that the policy applies as normal but will not be applied to paths specified in the notPaths parameter.

For a full list of supported operations, see https://istio.io/latest/docs/reference/config/security/authorization-policy/#Operation

Using service accounts with mesh identities

A common pattern used is to specify policies that grant access based on the identity of the caller. As we’ve already mentioned, Cloud Service Mesh provides each service with a secure identity, so we know who is calling. We can then create policies that only allow requests from the identified callers we choose. However, we don’t use the details of a TLS certificate in our AuthorizationPolicy objects, we instead reference a Service Account.

Service Accounts in Kubernetes give a distinct identity to a workload and work with Role Based Access Control to provide a way to control which objects a workload may access. You’re probably already using them in your own Kubernetes projects; if you’re not, I definitely recommend looking them up in the Kubernetes documentation! So how do they relate to AuthorizationPolicies?

Let’s imagine a typical scenario where we have 2 workloads: frontend and backend, and we want to only allow requests to the backend service if they come from the frontend workload. In other words, we shouldn’t allow direct connections to backend from anywhere else.

Here’s some example YAML we could use to create a unique service account identity for the frontend, and make sure it’s being used in the frontend deployment:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: frontend-sa
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
spec:
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
      - name: frontend
        image: frontend-image
      serviceAccountName: frontend-sa

We can now create an AuthorizationPolicy for our backend service, which will only allow access from the frontend service account as a principal. In Istio, principals are identified by their certificate authority, which in Cloud Service Mesh on Google Cloud is the name of your project. With a project ID of my-project-id, your policy YAML could look like this:

apiVersion: "security.istio.io/v1beta1"
kind: "AuthorizationPolicy"
metadata:
  name: "backend-access"
spec:
  selector:
    matchLabels:
      app: backend
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["my-project-id.svc.id.goog/ns/default/sa/frontend-sa"]

When a request comes from the identity of the frontend-sa service account, it will be allowed by this policy, as illustrated below. Requests can also be allowed from entire namespaces rather than individual principles using the namespaces source, and just like other conditions, we can construct rules with negative conditions such as notPrincipals and notNamespaces.

For a full list of supported options for the source of a request, see https://istio.io/latest/docs/reference/config/security/authorization-policy/#Source

Assembling and layering policies

As we mentioned earlier, you can stack ALLOW and DENY policies however you want to, but the way they are processed can be a little counterintuitive. The most important thing to remember is that DENY policies are evaluated first. If a single DENY policy matches a request, that request will be denied before any ALLOW policies have been evaluated. In some systems, it’s customary to use something akin to a DENY ALL rule and then build up specific ALLOW rules based on desired behaviors. With AuthorizationPolicies, that won’t work as the ALLOW rules will never be seen.

The recommended approach for this pattern is instead to use an ALLOW rule that matches nothing. In the absence of other ALLOW rules, this will cause all requests to be denied (in the same spirit as a DENY ALL rule). However, we can now add additional ALLOW rules to grant the policies we want to, and they will still be evaluated. The “Allow nothing” rule looks like this:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-nothing
spec:
  action: ALLOW

Conversely, we might choose to set up an “Allow all” rule. This is similar but contains an empty rules block that matches all workloads:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-all
spec:
  action: ALLOW
  rules:
  - {}

With an “Allow all” rule in place, we would have to build multiple DENY policies as well to stop undesirable connections. These would be evaluated first and could still block traffic, before any unmatched request is handled by the “Allow all” rule. As you can imagine, this approach leads to a lot of management overhead.

For this reason, the “Allow nothing” approach is definitley recommended. I just wanted to show you both approaches to help you understand the concepts involved.

Testing policies before enforcing them

Cloud Service Mesh also supports an Istio annotation that applies your policies in a “dry-run” mode. If a policy matches, it will always be allowed but the enforcement result will be written to Cloud Logging. You can enable “dry-run” mode on any AuthorizationPolicy by adding the following annotation to the object’s metadata:

metadata:
  annotations:
    "istio.io/dry-run": "true"

This feature can be useful if you need to test policies on live production traffic without actually enforcing it straight away.

Using policies to define what network connections count as legal and valid is a good way to document your system and enforce an important security principle. If you reduce the permitted connections between proxies to only the traffic that you know is required for your application stack, you are reducing the potential attack surface that could be exploited by a bad actor who has gained access to any component of your system. Policies can also protect components from services that are incorrectly configured or have been recently changed without proper testing.

Summary

We’ve now completed our journey through Cloud Service Mesh, a powerful and fully managed implementation of the popular Istio stack that is a key offering in GKE Enterprise. Hopefully the concepts of Service Mesh have now been demystified for you, and you’ll be able to determine if using Service Mesh is the right choice for your environments and workloads in the future.

Some argue against Service Mesh due to the additional complexity it can introduce. It’s true that using it effectively will require you to think about additional layers of configuration for your workloads. However, with Cloud Service Mesh it's never been easier to get started and leverage the benefits of Istio. Ultimately the level of increased complexity you want to add to your environment may depend on the original complexity of your application stack. Environments that may not warrant a mesh include single stateless services, or long-running applications with consistent levels of demand. Such environments are rare these days however, and if you’re already considering GKE Enterprise, chances are you have a complex microservices stack to deploy. Despite the extra work, a mesh will reward you by making traffic management, observability and security easier in the long run.

While we’ll be moving away from discussing Service Mesh, in my next post I’ll continue with the theme of security. Traffic control is just one part of our security toolbelt in Kubernetes, so stay tuned for more guidance on workload identity, binary authorization, network policies and how we integrate Google Cloud’s native security tools into our GKE workloads.

Cover image by Tung Lam from Pixabay

Securing GKE Workloads with Service Mesh

Table of contents