AWSsence: Exploring Event Monitoring

Today, I want to take a simple idea—monitoring an event in AWS CloudTrail—and explore several ways to implement a system that would alert interested parties. The purpose is to share various methods to set up monitoring. Some are simple, others are more elaborate. I will compare these solutions based on a few different criteria. You will see familiar services and possibly new ones you have not used. Each solution is accompanied by a Terraform script to stand it up for further exploration.

I approached this simple task with the idea of shoshin, a Japanese term describing the deliberate lack of preconceptions often brought by experts. The idea is to approach the subject openly or with “beginner’s mind”. In the words of Shunryu Suzuki, “In the beginner’s mind there are many possibilities, but in the expert’s there are few.”. The spirit of this article is not to provide definitive and “correct” answers but to open up possibilities.

CloudTrail records events in AWS with the idea of auditability. Who deleted the database? Who changed the WAF rules? How was a Lambda invoked? Unfortunately, CloudTrail doesn’t tell us why! It is up to the organization to investigate and hopefully find the cause with a spirit of psychological safety. CloudTrail must be combined with other services to automate alerting when an event does happen. This article explores some solutions. The accompanying Terraform scripts do not produce complete, production-ready solutions but will give you ideas.

In this article, I work with the notion of a Security Group in AWS. These can be attached to network interfaces to provide stateful traffic protection. Security Group settings are important security details. How will we know if somebody changes a rule within the Security Group? CloudTrail will capture the event. How can we broadcast it? Most of the ideas in this article involve CloudTrail combined with other services. I even put forth one solution that does not involve CloudTrail at all. First, I will start with the simplest solutions and move towards the more complex. The Terraform scripts accompanying each example will set up the solution. All you need to do is make some Security Group changes! Throughout the article, I link to examples of messages and templates.

SIMPLE EVENTBRIDGE (Terraform)

The above diagram shows perhaps the simplest solution possible for alerting. I show several diagrams in this article. In each diagram, the person at the top left changes a Security Group rule. It doesn’t matter how they are doing it. Most of the time, the end will always be the “Security Center” concerned with maintaining Security Group settings.

Most solutions include EventBridge. It allows us to react to events, filter, transform, and send the result to a target. In this article, we will see EventBridge doing all of this.

In this first solution, I have set up an EventBridge rule to listen for a specific event in CloudTrail, “ModifySecurityGroupRules”. The rule simply passes the corresponding CloudTrail log to SNS. From there, SNS emails the Security Center. A sample of an email message shows us what CloudTrail logged including who made the change, when, and the change itself.

Sounds perfect! But some things are potentially missing. First, we don’t see the original state of the Security Group rule. Hopefully, that was logged somewhere. Next, we only see individual occurrences of the incident and not aggregations. A Security Group rule change is not the best example of an event that could happen regularly and would be better to see in aggregation (maybe a Lambda invocation?). Finally, this solution does not include a remediating action. None of the solutions in this article went so far as automatically remediating the change. Nevertheless, I envision the EventBridge rule also targeting a Systems Manager Automation. Are there other possibilities?

In summary, this simple solution is not perfect. As you will see, none of the solutions in this article are perfect. But they all have strengths and could be combined. The strength of this solution is its simplicity. It got the job done and is also fast. The Security Center knows somebody/something changed the Security Group rule.

SIMPLE CONFIG (Terraform)

AWS Config monitors your infrastructure for compliance with standards and optionally remediates problems. In this example, I utilize Config’s ability to record infrastructure state and alert when a Security Group rules change. The architecture is below.

This looks like the first example with CloudTrail swapped out for Config. Config does not need CloudTrail to record infrastructure changes. EventBridge listens for a configuration change and passes Config’s message to SNS for email delivery to the Security Center.

How is this different from the first solution (Simple EventBridge)? Look at a sample email message delivered to the Security Center. The message contains the state of the Security Group rule both before and after the change. In the example, the CIDR on an inbound rule changed from 10.0.3.0/24 to 10.0.5.0/24. AWS Config records the history of resources to provide this information. Optionally, we could specify a Systems Manager Automation to remediate the Security Group rule.

However, there are a couple of drawbacks compared to the first solution. First, the notification can take up to a minute to be delivered to the Security Center compared to a few seconds for the first solution. Second, AWS Config incurs a monetary cost depending on how you use the service. I am not using a Config rule which helps with the cost. If your organization already uses AWS Config, additional costs would be negligible.

Note that at this time, you cannot tell Config to focus on a specific Security Group (like the Security Group created by Terraform) but to record all of them. I use the EventBridge rule to filter events down to our specific Security Group. Therefore, only changes to Terraform’s Security Group are reported to the Security Center.

The next two solutions return to CloudTrail and its ability to deliver logs to CloudWatch.

METRIC FILTER (Terraform)

CloudTrail can deliver logs to S3 or CloudWatch. The cost of storing logs in CloudWatch compares unfavorably to S3 (depending on retention). So the next two solutions would be more appropriate if your organization already uses CloudWatch to store CloudTrail logs. Let’s look at the first CloudWatch architecture.

After log delivery to CloudWatch, this example uses a metric filter to inspect logs for our event (ModifySecurityGroupRules) to trigger an alarm. The alarm in turn sends a message to SNS. SNS sends an email to the Security Center which can look at the logs for details.

Note the sample email message. The information focuses less on the CloudTrail log and more on the alarm itself. It does not present the Security Center with immediate information. The fact may or may not be important as they need to investigate regardless. In any event, this is certainly a step down from our previous solutions.

Moreover, it can take a few minutes after the event for the email to be delivered to the Security Center. The slow delivery of CloudTrail logs to CloudWatch is not ideal for situations when time is critical. Is CloudTrail log delivery to S3 a bit quicker? We will see later.

Finally, using metric filters is more appropriate for detecting events in aggregate. Again, the example of using Security Group rule changes may not be the best use case.

These drawbacks do not make this solution my favorite for this situation. But remember my intent of ‘beginner’s mind’. We can put this solution in our back pocket for a better fit in the future. Let’s take a look at the second CloudWatch solution.

LOG INSIGHTS (Terraform)

I experimented with CloudWatch Log Insights to expose any special characteristics to take advantage of. Let’s see how it worked.

The idea behind this solution is to use CloudWatch Log Insights to extract information from the CloudTrail logs. With Log Insights, I could query only the information I needed. I realized this would be a ‘pull-based’ solution. Something should execute the Log Insights query regularly to fulfill any need for time-sensitive updates. I created a Lambda function to initiate the query and retrieve the results. What did I find?

This solution is incomplete as I am not delivering a message to a Security Center. I needed more automation. This fact causes this solution to become a bit more brittle and monetarily expensive. The Lambda, written in Python, sends a query to Log Insights. The response from the query is technically JSON but not optimally designed for further processing. Moreover, I want the query to print JSON objects, not just individual field values. This would make the results more adaptable to specific situations. Log Insights appears not to provide this possibility (please let me know if I’m wrong). In conclusion, Log Insights is more appropriate for ad hoc use.

This solution initially showed potential but suffers from issues. Let’s move away from CloudWatch.

ATHENA AND QUICKSIGHT (Terraform)

This solution introduces the possibility of visual reporting as an alternative (or complement) to email for the Security Center. It takes advantage of Amazon QuickSight and it’s native integration to Amazon Athena.

The Terraform script sets up this solution, however, you need to sign up for QuickSight which is not free after the initial trial month. The README in the GitHub Terraform repository contains more details on the cost and additional setup.

The architecture results in a Quicksight table on a dashboard that can be refreshed on a schedule. The dashboard looks like below.

I chose to display a few fields: the time of the event, the name of the event, the IAM user who made the change, and details of the Security Group state. I instruct Quicksight to filter for our familiar event, ModifySecurityGroupRules. The information is extracted from CloudTrail logs delivered to S3 by Amazon Athena. Quicksight uses Athena as a data source. When the dashboard is refreshed, Athena queries S3. We can do more with Quicksight than create tabular data. I will leave that to others and their imaginations.

How does this compare to other solutions? CloudTrail takes a few minutes to deliver logs to S3 making it the slow link in the data flow. As with the metric filter solution, this solution alone may only be appropriate for aggregate information. Finally, Quicksight incurs an additional monetary cost. Nevertheless, this complements another quicker solution (Simple Config?). If your organization already uses Quicksight the extra cost is negligible. The solution is highly customizable.

SYSTEMS MANAGER OPSCENTER (Terraform)

OpsCenter in Systems Manager is a newer AWS service built for an organization’s Operations team to monitor and remediate specific infrastructure issues. OpsCenter ingests OpItems created by various means. This solution creates an OpsItem and sends an email to the Security Center.

Operations personnel can view OpsItems on the console and generate reports. In addition, they can set up automated remediations. OpsCenter will even recommend remediations. I didn’t go as far as to trigger a remediation.

This solution uses EventBridge to generate the OpsItem and deliver a message to SNS. The first EventBridge rule shows off some of the service’s power by transforming a CloudTrail log into an OpsItem. The content of files sample_eventbridge_event_message.json, sample_eventbridge_input_path.json, and sample_eventbridge_input_path.json in the Terraform GitHub repository is involved with transformation. The first file is the Cloudtrail log, the second is a mapping between fields in the CloudTrail log and the placeholders in the third file which defines the structure of the OpsItem. EventBridge delivers the OpsItem to OpsCenter.

The second EventBridge rule listens for an OpsItem Create event and publishes a message to SNS. SNS sends an email to the Security Center. A sample of the email shows a message geared towards the OpsItem Create event itself but is configurable to some extent to contain information from the original CloudTrail log.

The speed of event delivery to the Security Center is comparable to the Simple EventBridge solution, a few seconds. Similarly, this solution is geared towards individual occurrences of events. The monetary cost primarily consists of running OpsCenter. Again, if you already use OpsCenter the extra cost is minimal.

CONCLUSION

In this article, I expanded on a simple premise, “How do we monitor an event reported by CloudTrail”? AWS offers many possibilities and this article is by no means exhaustive. Is this an advantage or should AWS be more explicit about what you can do? I tend to favor the former. I compare the characteristics of several approaches. The comparisons show trade-offs. We can decide for ourselves which direction(s) to choose. This comes at the expense of a steeper learning curve.

Most importantly, I wanted to share my spirit for exploration and experimentation. With its sheer breadth, AWS affords many opportunities for anybody choosing to partake and discover. Thank you for taking the journey with me!