context engineering

What we see in this diagram is a context engineering architecture that goes far beyond simply writing a prompt and sending it to the model.

System Prompt

This serves as the backbone of the entire architecture, establishing the fundamental guidelines that govern the behavior of the LLM throughout the interaction. More than just initial instructions, it acts as a behavioral contract that defines not only what the model should do, but also how it should do it. Here, we define the assistant’s personality, specifying whether it should be formal or casual, technical or accessible, creative or factual. This initial configuration is crucial because it influences all subsequent responses, ensuring consistency and predictability in the system’s behavior.

User Prompt

This represents the user’s direct input, but its effectiveness goes beyond simply asking a question. It acts as a bridge between human intent and machine understanding, and its quality depends greatly on how the other components of the architecture have been structured and configured.

Short-term Memory (History)

This maintains the context of the current conversation, allowing the LLM to preserve coherence and continuity across sequential interactions. It is essential for creating natural conversational experiences, where the model can reference earlier information, follow complex discussion threads, and adapt responses based on previous exchanges. Effective use of this memory requires sophisticated strategies for managing the context window, where older information may be summarized or discarded to make room for new input, while keeping the most relevant elements for the ongoing conversation.

Long-term Memory

This represents a significant advancement in LLM-based systems, enabling them to store persistent information that remains relevant across multiple sessions or contexts. It transforms isolated interactions into continuous, personalized experiences by allowing the system to recall user preferences, past interactions, and accumulated knowledge over time. Implementing long-term memory involves considerable technical challenges, such as selecting relevant information to retain, organizing it for efficient retrieval, and ensuring the privacy and security of stored data.

RAG (Retrieval-Augmented Generation)

This is one of the most sophisticated components of the architecture. There are more than 20 RAG variations, ranging from basic forms to advanced implementations such as GraphRAG, Self-RAG, Corrective RAG (CRAG), and Adaptive RAG. Each variation addresses specific challenges, including improving the relevance of retrieved documents, enhancing the quality of generated responses, reducing hallucinations (fabricated information), and dynamically adapting to the type and complexity of a query. Choosing the appropriate RAG implementation depends on application-specific requirements, the type of knowledge being accessed, and the trade-offs between accuracy, speed, and computational cost.

Tools

Tools provide the essential capability to extend the LLM beyond text generation, transforming it into an intelligent agent that can perform real-world, measurable actions. This component allows the model to execute specific tasks such as complex calculations, real-time web searches, image generation, data visualization, integration with external APIs, database operations, and automation of routine tasks.

Effective implementation of tools requires a robust function-calling framework. The model must be able to recognize when a tool is needed, choose the correct one, prepare the appropriate parameters, and incorporate the results back into the conversation naturally and contextually.

Structured Output

Structured output ensures that system responses follow predefined formats when required, such as JSON, XML, or any other structure that downstream systems can process automatically. This is critical for integration into production pipelines, where the LLM’s output must be usable by other systems, applications, or automated processes.

Implementing structured output involves more than formatting. It includes schema validation, data consistency checks, and handling edge cases where the LLM might struggle to adhere precisely to the expected structure.

Guardrails

Guardrails act as the critical safety layer of the architecture, providing proactive and reactive checks against a wide range of security and compliance risks. This layer defends against prompt injection, where malicious users attempt to manipulate the system through crafted prompts. It also prevents jailbreaking, which refers to attempts to bypass the model's restrictions and limitations.

Guardrails help prevent the leakage of sensitive information, ensuring that confidential or proprietary data is not inadvertently revealed in responses. They also filter and block inappropriate, offensive, or potentially harmful content, keeping interactions within ethical and legal boundaries.

Effective guardrails require a multi-layered approach. This includes input filtering to detect harmful prompts, real-time monitoring during generation, and output validation to ensure compliance with established policies. Despite being overlooked in some implementations, this layer is essential for production environments, particularly in corporate or regulated settings.

This holistic approach to Context Engineering represents the natural evolution of working with LLMs, where each component works in synergy to create more robust, secure and effective experiences. It's a reminder that the real power of LLMs is not just in the model itself, but in how we orchestrate all these elements to create complete and reliable solutions.

Context Engineering: what really improves the performance of LLMs