I Attempted to Build an Agentic AI ... And It Immediately Got Stuck in a Loop.


If you’ve ever worked in an offensive security role, you know the feeling. You spend weeks simulating an adversary, meticulously finding vulnerabilities, and chaining together exploits. You write a beautiful, comprehensive 50-page report detailing every finding and hand it over to the blue team. Victory, right?
Not quite. Now the other work begins. The manual, eye-watering, soul-crushing drudgery of translating that report. You have to read every single finding, interpret the attacker's actions, and manually map each one to a standardized framework like MITRE ATT&CK®. It’s a process that can take hours, sometimes days, of time. It’s not advanced work, It’s not even fun work, but it is needed as ATT&CK nowadays is common language.
So, I asked the inevitable question: "What if we could automate this?"
And so began the MITRE ATT&CK Agent project—a journey to build a team of AI agents that could read a security report and do the mapping for us. What I imagined as a straightforward application of new technology quickly spiraled into a series of spectacular failures, head-scratching bugs, and profound lessons about what it really takes to move from a cool AI demo to a production-ready system. The fun thing about this project is that most models like Claude or Gemini are trained on ATT&CK data, so I have instant validation if I am successful or not.
This project is just for fun, it’s completely dwarfed by the fact that you could just ask a frontier model “extract the TTPs from this document”, and it will do so. This project is more about learning how to make similar functionality, in case we ever find ourself in a situation where the LLM is not pretrained on the data. ehem offensive coding patterns ehem.
This isn't just a success story. This is a war story. Over this three-part series, I'm going to take you through the entire journey—from initial architecture to catastrophic agent failures to the production-hardening that finally made it work. In this first dispatch, we'll lay the foundation: the architectural decisions, the data pipeline, and the operational setup that everything else depends on.
The Paradigm Shift: Thinking in Agents, Not Scripts
Before we dive into the technical weeds, it's essential to understand that we're not just building a better script; we're building in a different paradigm.
Think of a Large Language Model (LLM) like Google's Gemini as a brilliant, incredibly fast, but hopelessly naive intern. It has read more books than anyone in history, but it has zero real-world experience, no common sense, and an unfortunate tendency to make things up when it doesn't know the answer. A simple chatbot is just a conversation with this intern.
An agent, however, is that intern given a goal, a set of tools (like a calculator or web access), and a reasoning loop: Thought -> Action -> Observation -> Thought.... This allows it to tackle multi-step problems autonomously. An agentic crew is a full team of these specialized agents working in concert. Instead of one generalist, you have a team of specialists:
When I first designed the system, I mapped the workflow to how a team of humans might operate. This resulted in a four-agent crew: a Researcher to read the document, an Analyst to find techniques, a Validator to check the work, and a Writer to compile the report.
The Researcher: An agent whose only job is to ingest and structure raw data. (later proven to be an anti pattern, keep reading)
The Analyst: An agent that takes that structured data and looks for patterns.
The Validator: An agent that fact-checks the analyst's work against trusted sources.
The Writer: An agent that compiles the final, validated findings into a polished report.
It seems logical on the surface, but I quickly realized this was a fundamental mistake. The "Researcher" agent's job—parsing a document and breaking it into chunks—is a deterministic, predictable task. It doesn't require complex reasoning or decision-making. By assigning this job to an LLM-powered agent, I was falling into a common 'anti-pattern': using an expensive, slow, and sometimes unpredictable tool for a job that simple, reliable code could do better and faster. This led to unnecessary costs, slower performance, and a point of potential failure where none was needed.
This division of labor is the key to tackling a complex workflow like ours. When you decide to build such a system, your first choice is the framework. This was our first major architectural decision, and it set the course for the entire project.
The Framework Dilemma: Why CrewAI was the Only Real Choice
I considered visual, no-code platforms like N8N, which are fantastic for certain tasks. However, the choice between a code-first framework and a no-code platform is a strategic one, best explained with an analogy:
No-Code (N8N): The Nervous System. These platforms are brilliant for connecting APIs and automating linear, predictable tasks. They are the "nervous system" of an organization, perfect for workflows like, "When a new report is uploaded to Google Drive, send a Slack notification." They are reliable, visual, and excellent for simple integration.
Code-First (CrewAI): The Brain. CrewAI, on the other hand, is built to be the "brain." It excels where no-code tools struggle: managing complex, iterative reasoning loops, maintaining per-agent state and memory, and allowing for the deep, Python-native integration that our custom security tools require.
For our core intelligence task—parsing, analyzing, and validating nuanced security concepts—I needed the fine-grained control that only a code-based framework can provide. The table below breaks down our reasoning:
Feature | CrewAI (Code-First) | N8N (No-Code) | Our Verdict |
Core Unit | The Agent (with role, goal, memory) | The Node (a single step in a flow) | CrewAI's agent-first model was a perfect match for our problem domain, which required distinct analyst "roles." |
Flexibility | Infinite (full Python ecosystem) | Limited to pre-built nodes and custom JS | We needed to integrate with Python-native security libraries like mitreattack-python. This was trivial in CrewAI and a major hurdle in a Node.js environment. |
State Management | Built-in memory and context passing | Requires complex, manual state management | Our agents needed to maintain context across multiple reasoning steps. CrewAI was designed for this stateful, multi-step reasoning. |
Use Case | Complex reasoning, analysis, decision-making | API integration, linear automation | Our task was pure reasoning and analysis, not simple data transformation. |
Ultimately, trying to build our system in a no-code platform would have meant creating a complex microservice in Python anyway and just calling it from a single node. CrewAI let us build the entire intelligent system within a single, coherent, Python-native environment.
The Unseen 80%: Building the RAG Pipeline
The biggest lie in the AI hype cycle is that it's all about the model. The reality, as any practitioner knows, is that the vast majority of the work is unglamorous data engineering. An AI system is only as good as the data it's fed. Before our agents could analyze anything, I had to build a robust pipeline to prepare their "food."
What is RAG? Grounding the AI in Reality
The single greatest danger of any LLM-based system is hallucination. An LLM, when asked a question it doesn't know the answer to, will confidently invent a plausible-sounding answer. In a cybersecurity context, this is a catastrophic failure mode. We cannot have an agent inventing MITRE techniques or misclassifying real ones.
To solve this, I employed Retrieval-Augmented Generation (RAG). Instead of relying on the LLM's internal (and sometimes fallible) memory of the internet, we force it to consult a trusted, private knowledge base before making a decision. It's the difference between asking your intern to recall a fact from a book they read two years ago versus handing them the specific page and saying, "Tell me what this says."
This meant our first task was to build that trusted "library" for our agents. I specifically decided against using CrewAI's built-in memory=True feature for this. That feature is for conversational memory—remembering the last few turns of a conversation. It is not a permanent, searchable encyclopedia. For a knowledge base, I needed a dedicated, purpose-built RAG tool.
Step 1: Building the Knowledge Base - Ingesting the MITRE Corpus
We needed our knowledge base to be the definitive, authoritative source of the MITRE ATT&CK framework. I couldn't just scrape the website; that would be brittle, incomplete, and unprofessional. To do this properly, I went straight to the source: MITRE’s official STIX 2.1 JSON bundles, consumed via their TAXII 2.1 server.
STIX (Structured Threat Information eXpression) and TAXII (Trusted Automated eXchange of Intelligence Information) are the professional standards for sharing threat intel. Think of STIX as the universal file format (like a PDF for threats) and TAXII as the secure web server protocol (like HTTPS) used to transfer it. By building our system on these standards, we ensured our "ground truth" was always accurate, versioned, and complete. This allows our agents to reason not just about techniques, but also about the relationships between techniques, threat groups, and mitigations—a level of depth impossible with simple web scraping. We could set up a nightly job to pull updates, so our agents' knowledge base never goes stale.
I then flattened these complex, nested STIX objects into clean, coherent text blobs, ready for the next stage of the pipeline
Step 2: Taming the Input - The Art of Chunking & The War on Noise
With our knowledge base ready, I had to process the primary input: the security reports themselves. You can't feed an entire PDF to an LLM; you must break it down into digestible chunks. This process is far more art than science, a critical step that fundamentally determines the quality of the entire system's output.
Our approach involved several layers of refinement:
Semantic Chunking: A naive approach might be to simply split the text every 500 words. This is a terrible idea. You might split a sentence in half, separating a cause from its effect or a vulnerability from its remediation. Instead, I used a semantic chunking strategy. The process aims to split the document along logical boundaries—at the end of paragraphs, headings, or bullet points. This keeps related ideas together, preserving the context that is vital for accurate analysis. I aimed for chunks around 400 tokens, a sweet spot that's large enough for context but small enough to avoid introducing irrelevant information.
The War on Noise: When I first ran my parser on a real penetration test report, the results were horrifying. For a 50-page document, I got over 400 chunks. But upon inspection, more than half of them were utterly useless "noise." We're talking about headers that just said "Page 17 of 50," footers with "COMPANY CONFIDENTIAL," and entire pages dedicated to the Table of Contents or legal disclaimers. This actually triggered loops where the LLM got extremely confused thinking that it was analyzing the same chunk whilst in reality it was not the same chunk but just had the same text because of noise.
Building the Filter: Feeding this junk data to our expensive LLM would be the equivalent of asking a master chef to make a meal out of styrofoam peanuts and shredded paper. It would waste money, slow down the process, and, most importantly, confuse the AI. The solution was to build a crucial pre-processing step: a pattern-based noise filter. This is a simple but highly effective Python script that uses regular expressions and heuristic rules to identify and discard these junk chunks before they ever enter the AI pipeline. It looks for patterns like "Page [0-9]+ of [0-9]+", lines with only one or two words, or sections with titles like "Table of Contents." This simple, deterministic filtering step was one of the most significant optimizations I made, dramatically improving the signal-to-noise ratio of the data our agents would eventually analyze.
Step 3: The Embedding Decision - A Deep Dive into Self-Hosting vs. APIs
With clean chunks of text from both our knowledge base and the input report, I needed a way to compare them for semantic similarity. This is done by converting the text into embeddings—rich numerical representations, or vectors, that capture meaning.
Here, I faced another major architectural choice: use a powerful commercial API (like OpenAI's text-embedding-3-large) or self-host an open-source model? For a security application, the choice was clear. I self-hosted the BAAI/bge-small-en-v1.5 model. This decision was driven by three non-negotiable requirements and one strategic advantage:
Decision Driver | Commercial API (e.g., OpenAI) | Self-Hosted (Our Choice) | Why It Mattered for Us |
Data Privacy | Data sent to a third-party vendor. | All data remains within our environment. | Non-negotiable. Pentest reports are highly sensitive client data. Sending them outside our infrastructure was an unacceptable security risk. |
Cost | Pay-per-call. Thousands of chunks per report would be expensive. | One-time setup cost. Inference is effectively free. | Critical for scalability. We needed to process large documents with thousands of chunks without incurring runaway operational costs. |
Control | Limited to the vendor's model offerings and rate limits. | Full control over the model, version, and throughput. | Essential for performance. We could not be bottlenecked by an external API's rate limits during a large analysis job. |
Customization | Black box; cannot be modified. | Can be fine-tuned on domain-specific data. | Strategic Advantage. This allows us to fine-tune the model on cybersecurity-specific text in the future for even better performance on our niche task. |
Later on, for the processing logic I opted for a frontier model anyway, so whilst it defeats the “privacy” aspect this time around, it does lay the foundations for using a self hosted LLM. I simply decided against it since I want to be able to run demos on my laptop :).
To facilitate this, I designed the system with a "hot-swappable" embedding factory as well as "hot swappable" LLM models. This is a design pattern that abstracts the embedding logic. If I later decide to switch to a different model (like the security-specialized Darktrace DEMIST-2 or a more powerful commercial API for less sensitive tasks), I can do so with a simple configuration change, not a system rewrite. This architectural foresight is crucial for building maintainable, long-lasting AI systems.
Step 4: The Operational Backbone - Our Dockerized Microservices Stack
The Python AI ecosystem is a notorious minefield of conflicting dependencies, CUDA drivers, and "it works on my machine" syndromes. From day one, I knew a professional solution required a reproducible and portable environment. I absolutely had to avoid a monolithic application structure. Instead, we containerized the entire stack using Docker and docker-compose, treating each component as a distinct microservice.
Our stack is simple, robust, and horizontally scalable:
A Qdrant container for our vector database. It exposes a stable HTTP endpoint for all vector search operations. This isolates our data layer completely.
A Hugging Face Text-Embeddings-Inference (TEI) container. This is a dedicated, high-performance server that does one thing and does it well: it serves the BGE embedding model via its own REST API. This decouples the act of embedding from our main application logic. (this is a WiP, right now I pushed that logic into our third container but plan to migrate it out eventually)
Our main CrewAI application container. This holds all the agent logic and communicates with the other two services over a private Docker network.
This setup eliminates environment drift and provides a clean, professional path to deployment. It's the difference between a fragile script that only runs on one person's laptop and a reliable, scalable service ready for production.
The First Spectacular Failure: An "Efficient" Idea That Wrecked Everything
With our data pipeline engineered, our knowledge base built, and our entire stack containerized, I was ready. I was feeling clever. "I've got a document with 269 chunks," I reasoned. "Making 269 separate API calls is inefficient. I'll be smart! I'll 'batch' them."
I stuffed all 269 chunks into one massive, context-free prompt and made a single API call, expecting a neatly organized list of results. What I got was garbage.
This is a classic RAG anti-pattern. LLMs work by paying "attention" to the most relevant parts of a prompt. By cramming everything into a single query, I had diluted the context to the point of uselessness. The model, faced with a sea of text, did what any overwhelmed worker would do: it found the easiest, most obvious piece of work and ignored the rest. It would spot a single keyword like "PowerShell" in one chunk and write its entire analysis on that, completely ignoring the subtle but critical details about "Kerberos ticket abuse" in the other 268 chunks.
The fix was obvious in hindsight but required a foundational shift in how we instructed our agents. I had to abandon the flawed "batch" approach for a methodical, per-chunk analysis. It was our first hard lesson: in the world of AI, the path that seems most efficient is often the one that leads directly to failure. True efficiency comes from giving the AI the clean, focused context it needs to do its job correctly, one step at a time.
Conclusion: The Foundation is Laid. Now the Real Chaos Begins.
I've journeyed through the unglamorous but absolutely essential work of building the foundation for an intelligent system. I've chosen my framework, built a professional data pipeline, made strategic decisions about our AI's "brain," and containerized our stack for production.
I thought the hard part was over. I was wrong.
With the foundation laid, it was time to unleash the agents. What followed was a cascade of new, more terrifying problems. I had given the agents a brain, and they were starting to use it… occasionally to get stuck in infinite loops, hallucinate entire reports out of thin air, and lie with the unflinching confidence only a machine can possess. So convincingly in fact, that my guardrail validator agent believed the lying agent and started halucinating as well. Talking about shared psychosis…
In Part 2 of this series, we’ll leave the world of data engineering and enter the messy, frustrating, and fascinating art of prompt engineering. We'll cover the brutally direct prompting techniques needed to keep agents on task, the absolute necessity of a "Validator" agent to act as our AI's conscience, and the spectacular failure that taught us to never, ever trust an AI's output without verification. Stay tuned.
Subscribe to my newsletter
Read articles from Jean-Francois Maes directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Jean-Francois Maes
Jean-Francois Maes
Red Team Operator SANS Author of SEC565: Red Team operations and Adverary Emulation. SANS Co-Author of SEC699: Advanced Purple Team Tactics.