This series documents the journey of building "Aether," an offline-first, AI-native CLI tool that generates secure Infrastructure as Code (IaC) from natural language prompts. This week, we moved past basic setup and into the core challenge: creating a high-quality, secure dataset to train our AI model.

Phase 1 – The AI Dataset (Progress & Challenges)

Objective

The goal is to create a dataset of several hundred high-quality, "instruction-response" pairs. Each pair consists of a natural language prompt (e.g., "Create a secure S3 bucket") and a "perfect" CloudFormation YAML response. The quality of our final AI model depends entirely on the quality of this dataset.

Our core challenge was ensuring every YAML example is not just functional, but also secure and compliant with best practices.

Problem 1: Ensuring Security at Scale

Manually verifying the security of hundreds of YAML files is impractical and error-prone. We needed an automated way to enforce security best practices for every single example in our dataset.

Solution:

We integrated checkov, an open-source static analysis tool for IaC. The new rule for our workflow became: No YAML response is considered "perfect" until it passes the checkov scan with zero failures.

Problem 2: The Iterative Validation Cycle

Our first test was creating a secure S3 bucket for logging. This simple task immediately highlighted the importance of automated validation.

Initial Attempt:

A basic YAML file for an S3 bucket.

Error Message:

Bash

checkov -f test.yml

Check: CKV_AWS_18: "Ensure the S3 bucket has access logging enabled"
FAILED for resource: AWS::S3::Bucket.LoggingBucket

Cause:

The bucket, while private, was missing server access logging, a critical security feature for auditing.

Fix #1:

We modified the template to include a second bucket (LogDestinationBucket) and enabled logging from the primary bucket to the new one.

New Error:

Bash

checkov -f test.yml

Check: CKV_AWS_21: "Ensure the S3 bucket has versioning enabled"
FAILED for resource: AWS::S3::Bucket.LogDestinationBucket

Cause:

We had secured our primary bucket, but the new logging bucket we created was itself insecure. It was missing versioning, encryption, and public access blocks. This is a classic IaC pitfall.

Problem 3: The "Who Logs the Logger?" Dilemma

After securing the new logging bucket with all the same best practices, we faced one final, stubborn error.

Error Message:

Bash

checkov -f test.yml

Check: CKV_AWS_18: "Ensure the S3 bucket has access logging enabled"
FAILED for resource: AWS::S3::Bucket.LogDestinationBucket

Why it happened:

checkov correctly pointed out that our log destination bucket also needed logging enabled. While technically true, this would require a third bucket to store the logs for the log bucket, leading to unnecessary complexity for our dataset.

Fix (The "Perfect" Response):

This is a valid exception to the rule. We suppressed the check for this specific resource by adding a Metadata block to the CloudFormation template. This tells checkov that we have intentionally reviewed and accepted this exception.

YAML

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  LogDestinationBucket:
    Type: AWS::S3::Bucket
    Metadata:  # <-- The final fix to suppress the check
      checkov:
        skip:
          - id: 'CKV_AWS_18'
            comment: 'This is the destination bucket for logs'
    DeletionPolicy: Retain
    Properties:
      # ... other security properties

This iterative process resulted in a truly secure and validated template, ready for our dataset.

Automating the Workflow

Manually running checkov on every file is too slow. We wrote a simple Python script to automate the validation of the entire dataset file at once.

The Script (validate_dataset.py):

Python

import json
import subprocess
import tempfile
import os
from rich.console import Console

console = Console()
DATASET_FILE = "data.json1" # Updated filename

def validate_dataset():
    """
    Reads a JSONL dataset file, validates the YAML output of each entry 
    using checkov, and reports the results.
    """
    if not os.path.exists(DATASET_FILE):
        console.print(f"[bold red]Error:[/bold red] Dataset file '{DATASET_FILE}' not found.")
        return

    # ... (rest of the script) ...

This script loops through each entry, saves the YAML to a temporary file, runs checkov, and prints a simple "PASSED" or "FAILED" report, showing the errors for any failures. This turns hours of work into seconds.

Tools Used

Python 3.10
Checkov (for automated security scanning)
CloudFormation (YAML)
Visual Studio Code / nano

What's Working So Far

A clear project architecture for a self-contained, AI-native CLI tool.
A robust, automated workflow for creating and validating secure IaC dataset examples.
A seed dataset with several high-quality, validated entries.

Want to Follow Along?

I’ll be sharing weekly progress — issues, logs, architecture, and the AI model itself. If you've solved similar problems (like automated cloud optimization or building AI developer tools), I’d love to hear your insights.

Project Aether: Building an AI-Native IaC Tool From a Secure Foundation

Phase 1 – The AI Dataset (Progress & Challenges)

Objective

Problem 1: Ensuring Security at Scale

Problem 2: The Iterative Validation Cycle

Problem 3: The "Who Logs the Logger?" Dilemma

Automating the Workflow

Tools Used

What's Working So Far

Want to Follow Along?

Subscribe to my newsletter

Sahil Gada

Sahil Gada