Building Production-Ready Agentic Systems: Lessons from Real Implementation

Japkeerat SinghJapkeerat Singh
13 min read

Most articles about agentic systems show you the glossy marketing version. Multi-agent orchestration! Intelligent decision making! Autonomous task execution! The reality is more nuanced. After building a production agentic system that handles real ElasticSearch queries from actual users, I learned that reliable agent systems are less about AI magic and more about thoughtful software architecture with agents as components.

This tutorial examines a real implementation - an LLM ElasticSearch Agent, built using Google Agent Development Kit (yes, I am a hypocrite), that converts natural language queries into database operations - to understand what actually makes agentic systems work in production environments where failures matter.

Understanding What Agentic Systems Actually Solve

Before diving into architecture patterns, let me establish why you'd build an agentic system instead of a traditional application. Consider this user request: "Show me all failed login attempts from last week."

A traditional approach might involve a hardcoded query builder with preset filters. This works until users start asking variations like "Find security incidents from the past 7 days" or "Display authentication failures since Monday." Each variation requires code changes.

An agentic approach treats this as a multi-step reasoning problem. The system needs to understand that "failed login attempts" relates to authentication data, determine which database index contains this information, translate temporal expressions like "last week" into actual date ranges, generate the appropriate query syntax, execute it safely, and present results in human-readable format.

The key insight is that complex user requests often require multiple distinct capabilities working together. Rather than building one monolithic system that handles everything, agentic architectures decompose these requests into specialized agents that can be developed, tested, and debugged independently.

The Multi-Agent Pipeline Pattern

The most important architectural decision in my system was breaking the query process into distinct stages, each handled by a specialized agent. This isn't just good software engineering - it's essential for reliability when dealing with the unpredictable outputs that LLMs can produce.

My pipeline consists of three specialized agents working in sequence. The Index Selection Agent determines which ElasticSearch index contains the relevant data. The Query Generation Agent converts the natural language request into proper ElasticSearch DSL syntax. The Query Execution Agent runs the query safely and interprets results for the user.

class ElasticsearchPipelineAgent:
    def __init__(self):
        # Each agent specializes in one part of the problem
        self.index_selection_agent = create_index_selection_agent()
        self.query_generation_agent = create_query_generation_agent()
        self.query_execution_agent = create_query_execution_agent()

        # SequentialAgent coordinates the pipeline
        self.agent = SequentialAgent(
            name="ElasticsearchPipelineAgent",
            sub_agents=[
                self.index_selection_agent.agent,
                self.query_generation_agent.agent,
                self.query_execution_agent.agent,
            ],
        )

This design provides several practical advantages. First, each agent can be optimized for its specific task with tailored prompts and tools. The Index Selection Agent uses tools for discovering available indices and analyzing their schemas, while the Query Execution Agent focuses on safe query execution and result formatting. Second, failures can be isolated and handled appropriately at each stage. If index selection fails, you know exactly where to look and can potentially recover by prompting the user for clarification. Third, the system becomes much easier to test and debug when you can examine each stage independently.

The sequential approach also handles the state management challenge that trips up many agentic systems. Each agent receives the complete context from previous agents, building up the information needed for the final query execution.

Structured Output: The Foundation of Reliability

The biggest practical challenge in building reliable agentic systems is ensuring that agents produce consistent, parseable outputs that subsequent agents can work with reliably. This is where structured output schemas become absolutely critical.

Each agent in my pipeline defines exactly what it will output using Pydantic models. This isn't just good practice - it's essential for system reliability. Here's what the Index Selection Agent produces:

class IndexSelectionOutput(BaseModel):
    """Structured output for the Index Selection Agent."""

    selected_index: Optional[str] = Field(
        description="Name of the selected index, or null if selection failed"
    )
    index_schema: Optional[IndexSchema] = Field(
        description="Complete schema information for the selected index"
    )
    selection_metadata: SelectionMetadata = Field(
        description="Metadata about the selection process"
    )
    validation: ValidationResult = Field(
        description="Validation results for the selection"
    )

Notice how this structure handles partial failures gracefully. If index selection fails, the selected_index field is null, but the selection_metadata explains why it failed and the validation section provides specific error information. This allows the system to provide meaningful feedback to users rather than cryptic error messages.

The validation component is particularly important. Each agent validates not just its inputs, but also its outputs before passing them to the next stage. The Index Selection Agent doesn't just return an index name - it verifies that the index actually exists, that the schema was retrieved successfully, and that the selection is ready for query generation.

This validation-first approach catches problems early in the pipeline rather than letting them cascade through multiple agents before failing in confusing ways.

Tool Integration: Connecting Agents to Real Systems

Agents become useful when they can interact with external systems through tools. However, tool integration introduces significant complexity around error handling, security, and connection management that doesn't exist in simple LLM applications.

My ElasticSearch tools demonstrate several patterns for reliable tool integration. The most important principle is defensive programming - assume everything will fail and design accordingly.

class QueryExecutionTools:
    def execute_query(self, query_data: Dict[str, Any]) -> Dict[str, Any]:
        try:
            # Extract and validate query components - always check inputs first
            if "generated_query" not in query_data:
                return {"error": "No generated query found in query data"}

            generated_query = query_data["generated_query"]
            query_dsl = generated_query.get("query_dsl")

            # Security validation - ensure read-only operations
            # This prevents agents from accidentally modifying data
            if not self._is_read_only_query(query_dsl):
                return {"error": "Only read-only queries are allowed"}

            # Resource validation - ensure index exists before querying
            if not self.es.indices.exists(index=target_index):
                return {"error": f"Index '{target_index}' does not exist"}

            # Execute query with structured error handling
            response = self.es.search(index=target_index, body=query_dsl)

            # Return clean, structured results for LLM analysis
            return {
                "total_hits": response["hits"]["total"]["value"],
                "documents": [
                    {
                        "id": hit.get("_id"),
                        "score": hit["_score"],
                        "source": hit["_source"],
                    }
                    for hit in response["hits"]["hits"]
                ],
                "took_ms": response["took"]
            }

        except Exception as e:
            logger.error(f"Error executing query: {str(e)}")
            return {"error": f"Failed to execute query: {str(e)}"}

The pattern here is safety first, then functionality. Every tool method validates its inputs, checks that required resources exist, ensures operations are safe to execute, and returns errors in a consistent format that agents can understand and handle appropriately.

Connection management is another critical concern. Tools that interact with external systems need robust connection handling to prevent resource leaks and handle network failures gracefully. My system uses a singleton pattern for ElasticSearch connections, ensuring all tools share a single connection pool while maintaining thread safety across the agent system.

The security validation deserves particular attention. The _is_read_only_query method examines the query DSL to ensure it doesn't contain any write operations. This prevents the system from accidentally modifying data, which is essential when giving AI agents access to production databases.

State Management Between Agent Interactions

Unlike stateless HTTP APIs, agentic systems need to maintain context across multiple agent interactions. The Index Selection Agent needs to communicate its findings to the Query Generation Agent, which in turn needs to pass both the index information and the generated query to the Query Execution Agent.

My system handles this through ADK's session management capabilities, with utility functions that make state sharing explicit and reliable:

def save_index_selection_data(
    context: ToolContext,
    selected_index: str,
    index_schema: Dict[str, Any],
    reasoning: str,
    confidence: str = "high"
) -> str:
    """Save index selection data to session state for the next pipeline agent."""

    selection_data = {
        "selected_index": selected_index,
        "index_schema": index_schema,
        "selection_metadata": {
            # Store the reasoning so later agents understand the decision
            "reasoning": reasoning,
            "confidence": confidence
        },
        "validation": {
            # Include validation status for error handling
            "index_exists": True,
            "schema_retrieved": True,
            "ready_for_query_generation": True
        }
    }

    # Save to session state for next agent to access
    context.state["index_selection_data"] = selection_data
    context.state["selected_index"] = selected_index

    return f"Index selection data saved. Selected: {selected_index}"

This approach makes state management explicit rather than relying on implicit context passing. Each agent can access previous results through well-defined state keys, and the validation information helps agents understand whether they have reliable input data to work with.

State management becomes especially important for error recovery. If the Query Generation Agent fails to produce a valid query, the system still has the index selection results and can potentially retry with a different approach or ask the user for clarification, rather than starting the entire pipeline from scratch.

Observability: Making Agent Decisions Transparent

Production agentic systems require comprehensive observability. When an agent makes a poor decision or produces unexpected results, you need to understand its reasoning process to improve the system. This goes beyond traditional application logging because agent decisions involve complex reasoning that isn't visible from external behavior alone.

My system integrates with Phoenix for distributed tracing, treating each agent interaction as a span in a larger trace that shows how user queries flow through the agent pipeline:

async def process_user_query(runner, session_service, query: str, user_id: str, 
                           app_name: str, logger: logging.Logger, tracer=None):
    """Process user query with comprehensive tracing."""

    if tracer:
        with tracer.start_as_current_span("process_user_query") as span:
            # Record key attributes for debugging later
            span.set_attribute("user.id", user_id)
            span.set_attribute("user.query", query)
            span.set_attribute("app.name", app_name)

            # Process through agent pipeline with full trace visibility
            await _process_query_internal(
                runner, session_service, query, user_id, app_name, logger, span
            )

This provides me with a complete view of how user queries flow through the agent pipeline, how long each stage takes, and where failures occur. More importantly, it captures the reasoning and intermediate results from each agent, making it possible to understand why the system made particular decisions.

I encountered a practical challenge with OpenTelemetry tracing in async environments. The async generators and streaming patterns common in agent frameworks can cause context detachment errors that crash the application. I solved this with safe tracing utilities that handle these issues gracefully:

@contextmanager
def safe_tracing_context():
    """Context manager that safely handles OpenTelemetry tracing errors."""
    try:
        yield
    except Exception as e:
        error_msg = str(e).lower()
        # Check if this is a context-related tracing error
        if any(keyword in error_msg for keyword in ["context", "token", "detach"]):
            # Silently ignore context-related errors to prevent crashes
            logger.debug(f"Suppressed OpenTelemetry context error: {e}")
        else:
            # Re-raise non-context related errors since they're real problems
            raise

This ensures that tracing problems don't crash the agent system while still providing observability when possible. The lesson here is that production systems need to be resilient to their monitoring infrastructure failing.

Error Handling: When Agents Make Poor Decisions

Reliable agentic systems need sophisticated error handling strategies that account for the probabilistic nature of LLM outputs. Not all errors are equal - some should cause the system to stop completely, while others should trigger graceful degradation or alternative approaches.

My system implements error handling at multiple layers, each with different recovery strategies. At the tool level, individual tools return structured error information rather than throwing exceptions. This allows agents to understand what went wrong and potentially try alternative approaches.

At the agent level, agents interpret tool errors and decide whether to retry, use alternative approaches, or escalate the error. The structured output format includes success indicators and error messages that make this decision-making explicit:

class QueryExecutionOutput(BaseModel):
    """Structured output for the Query Execution Agent."""

    execution_results: Optional[ExecutionResults] = Field(
        description="Raw query execution results, or null if execution failed"
    )
    success: bool = Field(
        description="Whether the query execution was successful"
    )
    error_message: Optional[str] = Field(
        description="Error message if execution failed"
    )
    natural_language_response: str = Field(
        description="Natural language response based on analyzing the results"
    )

At the pipeline level, the orchestrator can route queries to alternative handlers if the primary ElasticSearch pipeline fails. For example, if a query can't be mapped to any available index, the system can fall back to answering general questions about ElasticSearch concepts.

The key insight is that different types of errors require different recovery strategies. A syntax error in query generation might be recoverable by retrying with additional context, while a security violation should immediately terminate the request. The structured error handling approach makes these distinctions explicit and actionable.

Configuration and Deployment Patterns

Building reliable agentic systems isn't just about the code - it's also about how they're configured, deployed, and maintained in production environments. My system demonstrates several patterns that have proven essential for operational reliability.

Different agents benefit from different language models based on their task complexity and cost requirements. My configuration approach allows easy experimentation without code changes:

# config.yaml - Flexible model configuration
agents:
  orchestrator: "openai/gpt-4o-mini"
  elasticsearch: "openai/gpt-4o-mini"  
  index_selection: "openai/gpt-3.5-turbo"  # Simpler task, cheaper model
  query_generation: "openai/gpt-4o-mini"   # Complex task, better model
  query_execution: "openai/gpt-4o-mini"

This configuration-driven approach allows me to optimize for cost and performance independently for each agent, and makes it easy to experiment with new models as they become available without touching the core system logic.

The deployment strategy acknowledges that agentic systems typically depend on multiple services. My Docker Compose setup orchestrates ElasticSearch for data storage, Kibana for data visualization, Phoenix for observability, and the agent system itself:

# docker-compose.yml - Complete service orchestration
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    # Configuration for data storage

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0  
    # Configuration for data visualization

  phoenix:
    image: arizephoenix/phoenix:latest
    # Configuration for observability

  llm-es-agent:
    build: .
    # Configuration for the agent system
    environment:
      - ES_HOST=http://elasticsearch:9200
      - PHOENIX_ENDPOINT=http://phoenix:6006

This approach ensures that the entire system can be deployed consistently across development, staging, and production environments. The environment variables handle deployment-specific configuration while keeping the core system logic environment-agnostic.

User Experience: Multiple Interfaces for Different Needs

Production agentic systems often need to serve different types of users through different interfaces. My system demonstrates this with both terminal and web interfaces sharing the same underlying agent logic. This architectural pattern has proven valuable in real deployments.

The core agent processing is abstracted into a reusable class that both interfaces can use:

class UnifiedAgentApp:
    """Unified application class supporting multiple interfaces."""

    def __init__(self):
        self.orchestrator = None
        self.runner = None
        self.session_service = None

    async def process_query(self, query: str, user_id: str) -> Dict[str, Any]:
        """Process user query through the orchestrator agent."""
        # Shared logic for both interfaces - the heavy lifting happens here

    def run_terminal_interface(self):
        """Run the terminal interface for developers."""
        # Terminal-specific UI logic - just handles input/output formatting

    def run_streamlit_interface(self): 
        """Run the web interface for business users."""
        # Web-specific UI logic - handles web-specific presentation

This separation allows the system to serve different user personas - developers who prefer command-line interfaces for debugging and scripting, and business users who prefer web interfaces for ad-hoc queries - without duplicating the complex agent orchestration logic.

The lesson here is that the interface is often less important than the underlying system reliability. Users will adapt to different interfaces as long as the core functionality works consistently and provides useful feedback when things go wrong.

What Actually Works in Production

After building and operating this system, several patterns have proven essential for production reliability that aren't obvious from reading about agentic systems in theory.

Structured everything. Use structured outputs for all agent communications. The time I spent defining Pydantic models upfront saved enormous debugging time later when agents produce unexpected outputs. Structured inputs and outputs make the system testable and debuggable in ways that free-form text communication simply doesn't allow.

Validate at every boundary. Each agent should validate its inputs, perform its work safely, and validate its outputs before passing them to the next agent. This catches problems early rather than letting them cascade through multiple agents before failing in confusing ways.

Design for partial failures. In traditional applications, you often design for success and handle failures as exceptions. In agentic systems, partial failures are common and often recoverable. Design your error handling to distinguish between recoverable issues and hard failures, and make recovery paths explicit.

Make decisions observable. The complexity of agentic systems means you need comprehensive observability to understand system behavior. But don't let observability infrastructure crash your system - implement defensive observability that degrades gracefully when monitoring systems fail.

Security by default. When giving agents access to external systems, implement security controls at the tool level rather than relying on prompt engineering or agent instructions. Agents will eventually try to do things they shouldn't, either through user requests or unexpected reasoning patterns.

Configuration over code. Agent behavior often needs to be tuned based on operational experience. Make key parameters configurable so you can adjust system behavior without code deployments. This is especially important for model selection and cost optimization.

My One Advice

If you're building agentic systems, focus on making each component reliable independently before optimizing agent interactions. The systems that work in production are built on solid engineering fundamentals, not prompt engineering magic.

0
Subscribe to my newsletter

Read articles from Japkeerat Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Japkeerat Singh
Japkeerat Singh

Hi, I am Japkeerat. I am working as a Machine Learning Engineer since January 2020, straight out of college. During this period, I've worked on extremely challenging projects - Security Vulnerability Detection using Graph Neural Networks, User Segmentation for better click through rate of notifications, and MLOps Infrastructure development for startups, to name a few. I keep my articles precise, maximum of 4 minutes of reading time. I'm currently actively writing 2 series - one for beginners in Machine Learning and another related to more advance concepts. The newsletter, if you subscribe to, will send 1 article every Thursday on the advance concepts.