Introduction

A fundamental principle that I have come to appreciate, after building CI/CD pipelines for some time, is the importance of identifying any potential issues at the earliest possible stage. This way, the team can spot and fix the issues before they leak into development or production environments. And guess what? This approach even has name; it's known as "shift left".

Unit tests are the classic example of the "shift left" paradigm (Image 1). Your Continuous Integration (CI) system pulls the code from your favorite source control and runs a bunch of unit-tests to make sure that the changes made to the code will not break the application. If any of the tests fail, the CI pipeline will break, and a notification will likely be sent for the team to be aware that the new version has issues. The end goal is to prevent a potential buggy version from being deployed to the development environments, or worse, to production.

Image 1. CI/CD Flow of a Kubernetes application based on GitHub, Jenkins, Slack and ArgoCD

But can we embed vulnerability and misconfiguration scanning as part of the CI in the same way we did with unit testing? Owing to the relatively small size of a microservice code base and the compact nature of the base images used for running the code (think Alpine Linux), the amount of code dependencies and system packages is also modest.

This makes the scanning of both the base image and code dependencies for vulnerabilities and misconfigurations a feasible action in a relatively short time. Therefore, vulnerability and misconfiguration scanning can be done as part of the CI flow. We just need the right tool for the job.

Trivy was nicknamed "The Swiss Army Knife for Security Scanning" and for a very good reason. It can run as a CLI tool (great for our CI needs) but also as a Kubernetes Operator for continuous scanning of the cluster from inside. The CLI part is really handy. It digs into different parts of our application and highlights any security issues we need to know about. Trivy can scan the file systems (before we package everything inside an image), but it can also scan images (base images or our application image with all of our code inside). In general, Trivy can scan for,

OS Packages vulnerabilities
Code dependencies vulnerabilities
Misconfigurations
Secrets exposed in the code base
Much more

Going back to the diagram in Image 1, we can enhance our CI/CD flow and add Trivy abilities to our CI flow (Image 2). If Trivy finds any vulnerabilities, misconfigurations or any other security issue, the build will fail and the CI system will prevent a potential insecure version from being deployed to any of our environments.

Image 2. CI/CD Flow of a Kubernetes application based on GitHub, Jenkins, Slack, ArgoCD with Trivy scanning for issues

Vulnerabilities are very dynamic and new ones are found by the hour (more on that in the Shift Right section). Trivy handles this by using a Database which it pulls for updates every 6 hours by default.

Shift Left - Trivy in the CI Flow

I will not delve into the details of how to install the Trivy CLI tool. There are many CI systems and CI services, and the installation of Trivy can differ from one to the next. The official installation documentation can be found here. A number of examples of integrations with CI systems can be found here.

When to Hit the Brakes?

Trivy categorizes the severity of identified vulnerabilities and issues into several levels: UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL. One of the questions I find myself asking all the time is, "When should I break the build and start searching for a fix?".

In a well-updated environment, there may be some low and medium vulnerabilities or issues, fewer high-level vulnerabilities or issues, and almost no critical ones. Alerting on the low and medium vulnerabilities or issues can cause much noise in the CI; this will make the CI cycle slow and delivery will be impacted (and frustrate the team). Alerting on high and critical vulnerabilities or issues should be much less noisy, without compromising too much on security (I hope).

So, usually I set Trivy to break the build (by sending exit code 1 to the runtime, --exit-code 1) when there are high or/and critical vulnerabilities or issues (--severity HIGH,CRITICAL). For example, when scanning a Dockerfile,

trivy config --exit-code 1 --severity HIGH,CRITICAL ./Dockerfile

Another decision that needs to be made is what if Trivy spots a high or critical issue, but there's no fix for it right now? Should we stop everything and wait, or just keep going? We can instruct Trivy to ignore vulnerabilities which are found but no fix is available by setting the --ignore-unfixed flag.

Personally, I am not in favor of using the --ignore-unfixed flag by default. My preferred approach is to break the build even in the absence of a known fix for a high or critical vulnerability. Then, some research and a decision needs to be made - is this vulnerability a high risk for us or can we live with it for some time until a fix is found? If the vulnerability is not a big deal for us, the --ignore-unfixed flag can be added for a specific period of time (can be just for one build or for a specific time period).

I use Trivy in three specific steps of the CI flow. The first two are executed before any code dependencies installation or unit tests execution. The last one is executed just before the CI pushes the new application image to the image repository.

Scanning the Application Dockerfile and Kubernetes Manifests Files

The first step is to scan the Dockerfile which is used to create the application image and the Kubernetes manifests (either helm or other). Trivy should find any misconfigurations that can result in an insecure application image or risky Kubernetes objects.

To understand why we should do this, lets take a look at a very simple example. Lets assume that someone made a change to the Dockerfile, resulting in an application image executing with the root user. Our original Dockerfile had the non privilege user "my-user" set,

 USER my-user

But then, someone accidentally switched out the directive to,

USER root

Best practice dictate that the application container should execute with a non privilege user. So the change above is a bad practice, which can lead to much trouble. Scanning the Dockerfile with Trivy,

trivy config --exit-code 1 --severity HIGH,CRITICAL ./Dockerfile

Will result in,

2024-01-23T12:17:19.686+0200    INFO    Misconfiguration scanning is enabled
2024-01-23T12:17:20.933+0200    INFO    Detected config files: 1

Dockerfile (dockerfile)

Tests: 19 (SUCCESSES: 18, FAILURES: 1, EXCEPTIONS: 0)
Failures: 1 (HIGH: 1, CRITICAL: 0)

HIGH: Last USER command in Dockerfile should not be 'root'
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Running containers with 'root' user can lead to a container escape situation. It is a best practice to run containers as non-root users, which can be done by adding a 'USER' statement to the Dockerfile.

See https://avd.aquasec.com/misconfig/ds002

We can do the same with Kubernetes manifests. For example, running the following Trivy scan,

trivy config --exit-code 1 --severity HIGH,CRITICAL ./helm-charts

Resulted in,

HIGH: Container 'my-app' of Deployment 'my-deployment' should set 'securityContext.readOnlyRootFilesystem' to true
════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
An immutable root file system prevents applications from writing to their local disk. This can limit intrusions, as attackers will not be able to tamper with the file system or write foreign executables to disk.

See https://avd.aquasec.com/misconfig/ksv014

So, if Trivy finds any issues with the Dockerfile or Kubernetes manifests the build will break.

Scanning the Application Dependencies

The second step is to scan the application dependencies for vulnerabilities. Trivy supports a large number of programming languages. This is done by running,

trivy filesystem --scanners vuln --include-dev-deps --exit-code 1 --severity HIGH,CRITICAL .

Note the --include-dev-deps flag. By default, Trivy will not scan dev-dependencies (like dependencies specified in the devDependencies section in a npm package.json file). For me, this is an issue. Although these dependencies will not be part of the final image, they will get installed and used as part of the testing steps. Having dependencies with vulnerabilities used on your CI system can lead to bad news and trouble. If Trivy flags any issues with any of the dependencies, the build will stop just in time.

Scanning the Final Application Image

The third step is to scan the final application image. This happens after all the unit tests were executed and the Dockerfile was used to package the application inside a base image resulting in an application image ready to be deployed.

This step will find issues and vulnerabilities in a number of potential hidden spots,

The base image itself can have outdated OS packages with vulnerabilities which didn't exist when the base image was constructed and scanned
The Dockerfile can install outdated OS packages to the base image, which will introduce OS package vulnerabilities in the final image
The Dockerfile is capable of installing application dependencies that are beyond the scope of the code base itself, such as executing npm or pip install commands within the Dockerfile. These external dependencies may introduce potential vulnerabilities.
Secrets and passwords can find their way into the final application image. This could happen in a few ways, like through the Dockerfile, getting mixed into the base image, or some other route.

The scan itself is done by running,

trivy image --exit-code 1 --severity HIGH,CRITICAL [IMAGE_NAME]

If Trivy spots any issues with the final image, the build will stop right there, preventing the problematic application image from sneaking into the image repository and ending up in one of the Kubernetes clusters.

Shift Right - Trivy Inside the Cluster

Unit tests are coupled to a code version. As long as no changes are introduced to the code, the results of the tests should remain the same. This means that when the tests pass and the code is deployed, we don't need to rerun the unit tests to assure the stability of the code over time.

Vulnerabilities are different. An application image with no vulnerabilities today may have critical ones tomorrow. Because we usually do not replace all of our running containers inside a Kubernetes cluster on a daily basis (and sometimes not even for weeks or months), the containers images used may accumulate vulnerabilities over time that we are not aware of.

Same goes for configuration files. New configuration best practices are published based on issues and vulnerabilities discovered. So, a configuration files scan done today may miss some new best practices recommended tomorrow (I assume that you never make manual configuration changes, and everything goes via a CI/CD flow).

We can address these issues by scanning our image repositories and configuration files either on a daily basis or when new vulnerabilities are discovered (AWS Inspector can do it for us for images stored in AWS ECR). Subsequently, we will need to identify the Kubernetes clusters where those images and configs were deployed and replace them with updated ones.

Tracing and automating this solution can be complex, particularly if there are numerous image repositories with many images added each day, and lots of Kubernetes clusters with different image versions deployed. Additionally, this approach requires us to keep track of all third-party images used in the clusters (such as ArgoCD, Prometheus, etc), which we not always keep in our own image repositories.

A different approach involves installing a component inside the cluster that scans the cluster internals both on a regular basis and each time a change is made (for example, when a Pod spec is changed). This controller should expose metrics and reports that we can utilize for alerts and monitoring. And guess what? Trivy Operator does exactly this!

The Trivy Operator

I will not go into details of how to deploy the Trivy Operator to the cluster. The official "Getting Started" guide contain examples of a number of ways to install the operator.

The operator manages a controller which runs a control loop. The control loop is responsible for the generation and continues update of security report resources, stored as Kubernetes CRDs which can be accessed via the Kubernetes API. The reports are generated or updated when,

A new resource is deployed or updated within the cluster (new version for a Deployment, new StatefulSet, change in an RBAC role, etc.)
On a recurring schedule. This is done by setting a TTL for the security reports (default to 24 hours). When the TTL expires, the reports are deleted, which causes the controller to rerun the scans and create the updated reports.

The Trivy Operator generates a number of reports, some of them are,

As you can see, it's truly a powerful tool that provides great visibility into the various components of the Kubernetes cluster, and enables continuous security monitoring and auditing of the cluster.

Running the Scans Jobs

The Trivy Operator utilizes Kubernetes Jobs to execute the scan tasks. Each time it needs to run a scan task, it will create a Job Pod. The Job Pod runs a Trivy init-container, which will pull the Trivy Database and save it to an emptyDir volume shared by all the containers of the Pod.

Next, if for example the Job is tasked with scanning the containers of a newly deployed application Pod, it will pull the Pod's containers (think of a Deployment Pod with an init-container, an application container, and maybe a sidecar) into the Job Pod. Then it will run the Trivy CLI in each container, using the Trivy Database in the emptyDir volume. Finally, the resulting logs of each Trivy CLI run will be aggregated into a security report (Image 3).

Image 3. Trivy Operator uses Kubernetes Jobs for scanning containers

I would like to offer you two tips. The first one is related to the Kubernetes nodes that the scan jobs run on; the second is related to isolated environments and/or large concurrent scan jobs.

At first, I tried to reuse nodes that I already had workloads running on and so tried to limit the CPU and memory usage of the scan Job Pod as much as possible. This resulted in OOM (Out of Memory) issues.

To overcome this, I decided to run the scan jobs on separated nodes. As I'm using EKS and Karpenter to manage most of the cluster nodes, I have created a separate NodePool that is dedicated only to Trivy Operator Jobs. Then, I have set the scanJobNodeSelector in the Helm values file of the Trivy Operator to select only the nodes of this specific NodePool.

When no jobs are running, Karpenter keeps the size of the NodePool at zero (money saved!). Once there is a need for scan jobs, Karpenter will see the Pending Pods and will start as many nodes as needed (the NodePool is based on Spot Instance only, so money saved again!). Once all jobs are done, Karpenter will again scale the NodePool to zero. This way, I can give the scan Job Pods more memory and CPU without impacting other running workloads in the cluster.

Second tip relates to situations where the scan jobs need to run on isolated nodes with no direct Internet access. By default, the scan jobs run in Standalone mode, where each job pulls the Trivy Database from GitHub. If the scan job cannot reach GitHub, it cannot start. Moreover, even if Internet connection is available, but there are large amounts of concurrent scan jobs launched at the same time, it can lead to rate limit issues with GitHub.

To address this issue, we may employ Trivy in its ClientServer mode. In this approach, we deploy another component of the Trivy Operator called the trivy-server. The trivy-server is responsible for updating the Trivy Database on a regular basis. As such, its the only component requiring Internet access. Furthermore, the scanning job Pods will not scan the images themselves. Instead, they will send a reference of the image to the trivy-server, which then conducts the scan and relays the findings.

Monitoring and Alerting

The Trivy Operator exposes the results of its scanning in the form of Prometheus Metrics. This enables visibility into the state of cluster vulnerabilities and misconfigurations. There is a pre-made Grafana Dashboard that visualizes the metrics in a nice way (Image 4).

Image 4. A sample of the Trivy Operator Grafana Dashboard

The real power comes with the ability to set alerts based on the metrics. For example, we can set an alert that triggers each time the value of the trivy_image_exposedsecret metric is greater than zero, notifying us when there is a potential exposed secret. Also, as the Operator re-scans the cluster on a daily basis, we can set alerts to trigger when there is an increase in the number of critical vulnerabilities compared to yesterday.

As you can see, we can craft a very effective vulnerability and misconfiguration monitoring system based on the metrics and deploy it to a number of clusters, which will provide a very good grip on security.

Unfortunately, as of now, the Trivy Operator lacks a frontend interface that we can use to see detailed information about the vulnerabilities and misconfigurations found. However, if you are using Lens, there is a Trivy Operator extension that you can add to Lens. The extension provides nice and detailed insight to the issues found and can help investigating alerts sent based on the Prometheus metrics.

To Sum Things Up

Either running as part of our CI/CD pipelines or as a Kubernetes Operator inside our clusters, Trivy is a great enabler in implementing DevSecOps and maintaining as secure an infrastructure as possible. Shifting left with Trivy will help prevent issues from reaching our clusters; shifting right again will help us detect and mitigate issues inside our clusters.

Have fun!

Trivy - Shifting Security From Right to Left and then Right Again

Table of contents