Note: This document provides structured and elaboratively encoded study notes designed for deep conceptual understanding, long-term retention, and easy navigation. It is just my personal rewritten version of concepts I’ve studied from Chip Huyen’s book, “AI Engineering.” It is not an official summary or reproduction — rather, a learning exercise to reinforce and share what I’ve learned.

Introduction to AI Engineering Architecture

This study note shifts focus from individual techniques to their synergistic application in building successful AI products. It addresses the common challenge of selecting the right tools by adopting a gradual, iterative approach to architecture development. Starting with the simplest model, it progressively adds components to tackle emerging challenges.

A critical element emphasized is the invaluable role of user feedback, not just for product refinement but as a vital data source for continuous model improvement in AI applications. The conversational nature of AI interfaces, while simplifying feedback collection for users, complicates signal extraction for developers. This study note delves into types of conversational AI feedback and strategies for effective, user-friendly collection.

Core Principles:

Iterative Development: Start simple, add complexity as needed.
Problem-Driven Architecture: Components are added to solve specific identified challenges.
User Feedback is Key: Essential for product direction and model improvement data.

AI Engineering Architecture: A Phased Approach

A full-fledged AI architecture can be intricate. This section outlines a common evolutionary path observed in production environments, highlighting shared components across diverse AI applications. While the proposed architecture is general, specific applications may require deviations.

The Simplest Architecture: Direct Model Interaction

Initially, an AI application operates at its most basic level:

Process: A user sends a Query directly to a Model API. The model performs Generation and returns a Response to the user.
Components:
- User: Initiates the interaction.
- Model API: Represents the interface to the underlying AI model. This can be a third-party service (e.g., OpenAI, Google, Anthropic) or a self-hosted model (inference server).
- Generation: The core function of the model, producing an output based on the input query.
Limitations:
- No external data access.
- No protective measures (guardrails).
- No performance or cost optimizations.

Visual Representation (Figure 1):

Active Recall Question: What are the key limitations of the simplest AI application architecture?
Answer: It lacks context augmentation, guardrails, and optimizations for latency or cost.

Gradual Architectural Enhancements

As applications mature and needs arise, the simple architecture is augmented through a series of iterative steps:

Enhance Context Input: Grant the model access to external data sources and tools for comprehensive information gathering.
Implement Guardrails: Introduce protective measures for the system and users.
Add Model Router & Gateway: Support complex pipelines, enhance security, and manage multiple models efficiently.
Optimize with Caching: Reduce latency and costs using various caching strategies.
Integrate Complex Logic & Write Actions: Maximize system capabilities by enabling advanced operational patterns and direct environmental modifications.

Note: The order of these steps can vary based on application-specific requirements. Monitoring, observability, and orchestration (chaining components) are integral and discussed later.

Step 1: Enhance Context – Giving Models Information

The first significant expansion involves enabling the AI system to construct relevant context for each user query. This is akin to providing the model with a robust "knowledge base" and "toolset" beyond its initial training data.

Purpose: To provide the model with the necessary information to produce high-quality, relevant outputs. Often considered "feature engineering" for foundation models.
Mechanisms:
- Retrieval Mechanisms (RAG - Retrieval Augmented Generation):
  - Text Retrieval: Searching and retrieving relevant text documents.
  - Image Retrieval: Accessing and understanding visual information.
  - Tabular Data Retrieval: Querying structured data from tables.
- Databases: Serve as repositories for various data types:
  - Documents (e.g., internal knowledge bases, articles)
  - Tables (e.g., SQL databases for transactional data)
  - Chat History (for conversational memory)
  - Vector Databases (for semantic search of embeddings)
- Tool Use (Agentic Capabilities): Allows the model to automatically gather real-time information or perform actions via APIs:
  - Web search (e.g., current events, general knowledge)
  - News APIs, Weather APIs, Event APIs, etc.
  - Internal company APIs (e.g., CRM, inventory).

Provider Differences in Context Support

While many model API providers (OpenAI, Claude, Gemini) support context construction and tool use, their capabilities vary:

Document Upload Limits: Generic model APIs might have strict limits on the number or size of documents that can be uploaded for context, whereas specialized RAG solutions can accommodate vast amounts of data in a vector database.
Retrieval Algorithms: Differences exist in underlying retrieval algorithms and configurations (e.g., chunk sizes) impacting relevance.
Tool Support: Variations in supported tool types, execution modes (e.g., parallel function execution), and handling of long-running jobs.

Visual Representation (Figure 2):

Active Recall Question: How does "context construction" enhance an AI model's capabilities beyond its initial training data?
Answer: It provides the model with access to external, up-to-date, or proprietary information (via databases and read-only actions like web search or vector search) and allows it to process this information to create a more informed prompt, overcoming the limitations of its static training data.

Step 2: Put in Guardrails – Protecting the System

Guardrails are crucial protective layers designed to mitigate risks and safeguard both the AI system and its users. They are implemented at points of potential exposure, generally categorized into input guardrails and output guardrails.

Input Guardrails

Input guardrails protect against two primary risks:

Leaking Private Information (PII) to External APIs: This is especially relevant when using third-party model APIs, as sensitive data might be inadvertently sent outside the organization.
- Scenarios for Leakage:
  - An employee includes company secrets or user PII in a prompt.
  - A developer embeds internal policies/data into the application's system prompt.
  - A tool retrieves sensitive information from an internal database and adds it to the model's context.
- Mitigation:
  - Sensitive Data Detection Tools: Use AI-powered tools to automatically identify and flag sensitive data.
    - Common Sensitive Data Classes: Personal Identifiable Information (ID numbers, phone numbers, bank accounts), Human Faces, Company Intellectual Property (specific keywords/phrases).
  - Action on Detection:
    - Block Query: Prevent the entire query from reaching the model.
    - Redaction/Masking: Remove or replace sensitive information with placeholders (e.g., [PHONE NUMBER], [ACCESS_TOKEN]).
  - Reversible PII Map (Figure 3): For masked information, a reverse dictionary maps the placeholder back to the original sensitive data for unmasking the model's response before it reaches the user, ensuring confidentiality while preserving functionality.
    - Example (Figure 3):
      - User Query: pat = "secret_token_that_shouldn't_be_leaked"
      - Masked Query: pat = [ACCESS_TOKEN] (using a reversible PII map)
      - Model Response: Uses [ACCESS_TOKEN].
      - Unmasked Response: pat = "secret_token_that_shouldn't_be_leaked" (before sending to user).
Executing Bad Prompts (Prompt Hacks/Attacks): Guardrails help defend against adversarial prompts (e.g., prompt injection, jailbreaking) that aim to compromise the system or extract unauthorized information. While risks can be mitigated, they can never be fully eliminated due to the probabilistic nature of models and human error.

Output Guardrails

Output guardrails primarily serve two functions:

Catch Output Failures: Detect when the model's response does not meet defined standards.
Specify Policy for Failures: Determine the appropriate action to take when a failure is detected.

Types of Failures:
- Quality Failures:
  - Mal-formatted Responses: Not adhering to expected output formats (e.g., invalid JSON when JSON is expected).
  - Factually Inconsistent/Hallucinated Responses: Generating information that is incorrect or unsubstantiated.
  - Generally Bad Responses: Subjectively poor quality (e.g., a poorly written essay).
- Security Failures:
  - Toxic Responses: Containing racist, sexual, or illegal content.
  - Private/Sensitive Information Leaks: Model inadvertently revealing confidential data.
  - Remote Tool/Code Execution Triggers: Generating outputs that could initiate harmful external actions.
  - Brand-Risk Responses: Misrepresenting the company or competitors.
False Refusal Rate: Crucial to track for security measures. Systems that are too secure can block legitimate requests, leading to user frustration.
Mitigation Strategies:
- Simple Retry Logic: Since AI models are probabilistic, retrying a query can yield a different, correct response.
  - Sequential Retry: If a response fails, retry X times. Increases user-perceived latency.
  - Parallel Retry: Send the query multiple times concurrently, then select the best response. Increases API calls but reduces latency.
- Human Fallback: Transferring complex or problematic queries to human operators.
  - Triggers: Specific phrases, sentiment (e.g., anger detected), or number of conversational turns (to prevent loops).

Guardrail Implementation Considerations

Trade-offs:
- Reliability vs. Latency: Guardrails add processing time. Some teams prioritize low latency over extensive guardrails.
Stream Completion Mode Challenges:
- Hard to evaluate partial responses as tokens stream. Unsafe content might be shown before full detection.
Self-hosting vs. Third-Party APIs:
- Third-Party APIs: Providers often include out-of-the-box guardrails, reducing the need for custom implementation.
- Self-hosting: Eliminates external data transfer, reducing the need for certain input guardrails (e.g., PII leakage to external parties). However, you are fully responsible for all other guardrails.
Levels of Implementation:
- Model Providers: Balance safety with model flexibility.
- Application Developers: Implement additional guardrails on top of provider ones ("Defenses Against Prompt Attacks").
Off-the-shelf Solutions: Meta's Purple Llama, NVIDIA's NeMo Guardrails, Azure's PyRIT/AI content filters, Perspective API, OpenAI's content moderation API. Many model gateways also offer guardrail functionalities.

Visual Representation (Figure 4):

Note: Scoring is placed under Model API as it's often AI-powered (smaller models), but conceptually could also reside within Output Guardrails.

Active Recall Question: Explain the difference between input and output guardrails and provide an example of a risk each type protects against.
Answer: Input guardrails protect against issues before the query reaches the model (e.g., PII leakage, prompt injection). Output guardrails protect against issues after the model generates a response (e.g., hallucinated facts, toxic content, malformed output).

Step 3: Add Model Router and Gateway – Managing Complexity

As AI applications scale to use multiple models and require sophisticated management, Routers and Gateways become essential for handling complexity and optimizing costs.

Router

A Router intelligently directs queries to the most appropriate model or solution.

Purpose: To manage complexity and costs by using specialized models for different query types.
Benefits:
- Specialized Models: Allows dedicated models to achieve higher performance for specific tasks (e.g., one model for technical troubleshooting, another for billing inquiries).
- Cost Savings: Enables routing simpler queries to cheaper, smaller models, reserving expensive, powerful models for complex tasks.
Intent Classifier: The core component of a router is an intent classifier that predicts the user's goal or purpose from their query.
- Examples of Routing based on Intent:
  - "Reset my password" → Route to FAQ page about password recovery.
  - "Correct a billing mistake" → Route to a human operator.
  - "Troubleshooting an issue with feature X" → Route to a chatbot specialized in technical support.
- Out-of-Scope Detection: Prevents the system from engaging in irrelevant conversations, saving API calls (e.g., "Who will you vote for?" → "As a chatbot, I don't have the ability to vote...").
- Ambiguity Detection: Helps identify unclear queries and prompts for clarification (e.g., "Freezing" → "Do you want to freeze your account or are you talking about the weather?").
Other Router Types:
- Next-Action Predictor: For agents, decides the next best action (e.g., use code interpreter, search API).
- Memory Hierarchy Router: Predicts which part of a memory system to retrieve information from (e.g., attached document vs. web search for "Melbourne animals").
Implementation: Often implemented using smaller, faster language models (e.g., GPT-2, BERT, Llama 7B) or even custom-trained classifiers. Speed and cost are critical for routers.
Context Limit Adjustments: Routers can help manage varying context limits of different models. If retrieved context exceeds a model's limit, the router might truncate it or route the query to a model with a larger context window.
Placement: Typically placed within the Model API box in the architecture diagrams, as they are often AI-powered models themselves, albeit smaller and faster. Routing often occurs before retrieval but can also happen afterward (e.g., routing to a human).

Gateway (Model Gateway)

A Model Gateway acts as an intermediate layer providing a unified and secure interface to various AI models.

Purpose: To abstract away the complexities of interacting with multiple model APIs, whether self-hosted or commercial.
Basic Functionality: Unified Interface:
- Provides a single API endpoint to interact with different models (e.g., OpenAI, Google, custom self-hosted).
- Simplifies code maintenance: if a model API changes, only the gateway needs updating, not every application consuming it.

Code Example (Conceptual):

  import google.generativeai as genai
  import openai

  def openai_model(input_data, model_name, max_tokens):
      openai.api_key = os.environ["OPENAI_API_KEY"]
      response = openai.Completion.create(
          engine=model_name,
          prompt=input_data,
          max_tokens=max_tokens
      )
      return {"response": response.choices[0].text.strip()}

  def gemini_model(input_data, model_name, max_tokens):
      genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
      model = genai.GenerativeModel(model_name=model_name)
      response = model.generate_content(input_data, max_tokens=max_tokens)
      return {"response": response["choices"][0]["message"]["content"]}

  @app.route('/model', methods=['POST'])
  def model_gateway():
      data = request.get_json()
      model_type = data.get("model_type")
            model_name = data.get("model_name")
            input_data = data.get("input_data")
            max_tokens = data.get("max_tokens")

            if model_type == "openai":
                result = openai_model(input_data, model_name, max_tokens)
            elif model_type == "gemini":
                result = gemini_model(input_data, model_name, max_tokens)
            return jsonify(result)

Access Control and Cost Management:
- Centralized Access: Instead of distributing organizational API keys, users/applications access the gateway, providing a controlled point of entry.
- Fine-Grained Control: Specify which users/applications can access which models.
- Usage Monitoring & Limits: Track and limit API calls to prevent abuse and manage costs.
Fallback Policies:
- Manages rate limits or API failures by routing requests to alternative models, retrying gracefully, or handling errors. Ensures application continuity.
Other Functionalities:
- Load balancing, logging, analytics.
- Some gateways also incorporate caching and guardrails.
Off-the-shelf Gateways: Portkey’s AI Gateway, MLflow AI Gateway, Wealthsimple’s LLM Gateway, TrueFoundry, Kong, Cloudflare.

Visual Representation (Figure 5 - Router):

Visual Representation (Figure 6 - Gateway Concept):

Visual Representation (Figure 7 - Full Architecture with Gateway):

NOTE: A similar abstraction layer, a tool gateway, could also be useful for accessing various tools, though it's less common currently.

Active Recall Question: What is the primary role of a Model Gateway in a complex AI application, and how does it benefit security and cost management?
Answer: A Model Gateway provides a unified, secure interface to various AI models (self-hosted or commercial). It benefits security by centralizing access control and preventing the direct distribution of API keys. For cost management, it enables monitoring and limiting API usage, preventing abuse.

Step 4: Reduce Latency with Caches – Speeding Up Responses

Caching is a fundamental optimization technique in software engineering, and its principles are equally vital for AI applications to reduce latency and costs. This section focuses on system caching, distinct from inference caching techniques (like KV caching or prompt caching) discussed in Inference Optimization, which are typically handled by model API providers.

Exact Caching

Mechanism: Stores and retrieves items only when an exact match to a previously processed request is made.
Examples:
- Summaries: If a user requests a summary of a specific product, the system checks if that exact summary is cached.
- Embedding-Based Retrieval: If an incoming query for vector search has already been embedded and searched, the cached vector search result is returned without re-computation.
Benefits: Highly effective for queries involving multiple steps (e.g., chain-of-thought processing) or time-consuming actions (e.g., external retrieval, SQL execution, web searches).
Implementation:
- Storage: Can use in-memory storage (for speed) or databases like PostgreSQL, Redis, or tiered storage (balancing speed and capacity).
- Eviction Policies: Crucial for managing cache size and performance. Common policies include:
  - LRU (Least Recently Used): Discards the least recently used items first.
  - LFU (Least Frequently Used): Discards the least frequently used items first.
  - FIFO (First In, First Out): Discards the oldest items first.
Cache Longevity:
- User-specific queries ("What's the status of my recent order?") are less likely to be reused by others and shouldn't be cached widely.
- Time-sensitive queries ("How's the weather?") also have limited cache utility.
- Many teams train classifiers to predict whether a query should be cached.
⚠️ WARNING: Data Leaks with Caching:
- Improper handling of cached data can lead to serious privacy breaches.
- Scenario: User X asks "What is the return policy for electronics products?" If the policy is user-membership dependent, the system might retrieve X's private information and generate a personalized response. If this response is mistakenly cached as a "generic" answer, then User Y asking the same generic question later could receive the cached response containing X's private data. This highlights the need for careful design and categorization of cacheable content.

Semantic Caching

Mechanism: Allows reuse of cached items even if the incoming query is only semantically similar (not identical) to a cached query.
Example: User 1 asks "What's the capital of Vietnam?" (Answer: "Hanoi"). User 2 asks "What's the capital city of Vietnam?". With semantic caching, the system can reuse the cached "Hanoi" answer.
Benefits: Aims to increase cache hit rates and potentially reduce costs by leveraging similar queries.
How it Works (Semantic Similarity Recap):
1. Embedding Generation: For each incoming query, an embedding is generated using an embedding model.
2. Vector Search: The embedding of the incoming query is used in a vector database to find the cached embedding with the highest similarity score.
3. Similarity Threshold: If the highest similarity score (X) exceeds a predefined threshold, the cached result is returned. Otherwise, the new query is processed and its embedding/result cached.

Requires a vector database to store cached query embeddings.

Challenges/Dubious Value:
- Prone to Failure: Relies heavily on high-quality embeddings, accurate vector search, and a reliable similarity metric.
- Performance Reduction: Incorrect similarity detection can lead to wrong answers being served from the cache, undermining model performance.
- Tricky Thresholding: Setting the right similarity threshold requires significant trial and error.
- Compute-Intensive: Vector search itself can be time-consuming and compute-intensive, especially with large cached embedding sets.
When it's Worthwhile: Semantic caching might be beneficial if the cache hit rate is consistently high, significantly offsetting the associated efficiency, cost, and performance risks. Thorough evaluation is critical before implementation.

Visual Representation (Figure 8 - with Caches):

Active Recall Question: What's the fundamental difference between exact caching and semantic caching, and what is a significant risk unique to caching in LLM applications?
Answer: Exact caching reuses results only for identical queries, while semantic caching reuses results for semantically similar queries. A significant risk unique to LLM caching is data leakage: a cached personalized response for one user might be inadvertently served to another user if the query is deemed generic but contained sensitive, user-specific information.

Step 5: Add Agent Patterns – Enabling Complex System Behavior

While the architectures discussed so far mostly handle sequential flows, real-world applications often demand more complex interactions, involving loops, parallel execution, and conditional logic. This is where Agent Patterns become critical.

Purpose: To enable the AI system to perform complex, multi-step tasks by allowing its outputs to influence subsequent actions and iterations.
Feedback Loops (Figure 9):
- An agentic system can evaluate its own generated output.
- If the task isn't fully accomplished, it can decide to perform another retrieval or action to gather more information.
- The original response, combined with newly retrieved context, is then fed back into the same or a different model for further processing.
- This creates a dynamic loop, allowing for iterative refinement and problem-solving.

Visual Representation (Figure 9 - Feedback Loop) - Orange arrow highlights the feedback loop**:**

Write Actions:
- Beyond generating text or retrieving information, models can be empowered to initiate write actions, which directly modify the external environment.
- Examples: Composing and sending an email, placing an order in an e-commerce system, initializing a bank transfer, updating a database entry.
- Impact: Write actions vastly expand a system's capabilities, allowing it to move beyond informational responses to direct operational execution.
- Risks: They also introduce significantly higher risks. Giving a model write access requires the utmost care and robust safety mechanisms, as errors or malicious prompts could lead to irreversible and damaging consequences.

Visual Representation (Figure 10 - with Write Actions):

Complexity Implication: As the architecture grows (especially with agent patterns), it introduces more failure modes, making debugging significantly harder. This underscores the critical need for robust Monitoring and Observability (discussed next).

Active Recall Question: What is the primary difference between "read-only actions" and "write actions" for an AI system, and what crucial consideration comes with implementing write actions.
Answer: Read-only actions allow the AI system to retrieve information without modifying the environment (e.g., search web, query database). Write actions allow the AI system to change the environment (e.g., send emails, update orders). The crucial consideration with write actions is the significantly increased risk; they must be implemented with utmost care due to potential for irreversible errors or malicious exploitation.

Monitoring and Observability – Ensuring System Health

Monitoring and observability are not afterthoughts but should be integral to the design of any complex software product, especially AI applications.

Goal of Monitoring

The fundamental goal is two-fold:

Mitigate Risks: Identify and address issues like application failures, security attacks, and data/model drifts.
Discover Opportunities: Pinpoint areas for application improvement and cost savings.

Accountability: Provides visibility into system performance.

Key Observability Metrics (from DevOps)

These metrics help evaluate the quality of a system's observability:

MTTD (Mean Time To Detection): How long it takes to identify that something has gone wrong.
MTTR (Mean Time To Response): How long it takes to resolve an issue after it's detected.
CFR (Change Failure Rate): The percentage of changes or deployments that result in failures requiring fixes or rollbacks. A high CFR doesn’t necessarily mean a bad monitoring system, but indicates a need to redesign the evaluation pipeline to catch bad changes pre-deployment.

Relationship with Evaluation: Evaluation metrics should translate well to monitoring metrics. Issues found during monitoring should inform and improve the evaluation pipeline.

Monitoring vs. Observability (Key Distinction)

Monitoring: Makes no assumptions about a system's internal state. It tracks external outputs to infer if something is wrong internally, without a guarantee of pinpointing the exact cause.
Observability: A stronger concept, assuming a system's internal states can be inferred from its external outputs. It's about instrumenting the system to collect sufficient information (logs, metrics) at runtime, enabling diagnosis of what went wrong without needing new code deployments.

Terminology Used: "Monitoring" refers to the act of tracking information, while "Observability" encompasses the entire process of instrumentation, tracking, and debugging.

Metrics in AI Applications

Metrics are condensed numerical representations of events over time, signaling when something is wrong or highlighting improvement opportunities. Their design should align directly with desired failure modes to catch.

Application-Specific Design: Which metrics to track is highly dependent on the specific application's goals and potential failure points. Requires analytical thinking, statistical knowledge, and creativity.
Recap of Model Quality Metrics:
- Format Failures:
  - Frequency of invalid JSON outputs.
  - Proportion of easily fixable mal-formatted outputs vs. harder ones.
- Open-Ended Generations:
  - Factual consistency (e.g., can output be inferred from context?).
  - Relevance and quality (conciseness, creativity, positivity), often via AI judges.
- Safety Metrics:
  - Toxicity-related metrics.
  - Detection of private/sensitive information in inputs/outputs.
  - Guardrail trigger rate.
  - False refusal rate (important for usability).
  - Detection of abnormal queries (potential prompt attacks, edge cases).
- User Natural Language Feedback & Conversational Signals:
  - User stop generation rate.
  - Average turns per conversation.
  - Average tokens per input (user prompt changes, more complex tasks or conciseness).
  - Average tokens per output (model verbosity, query types resulting in lengthy answers).
  - Model's output token distribution (diversity changes over time).
- Component-Specific Metrics:
  - RAG Applications: Context relevance, context precision.
  - Vector Databases: Storage requirements, query latency (how long it takes to query the data).
- Length-Related Metrics:
  - Longer contexts and responses → increase latency and incur higher costs.
Correlation with Business North Star Metrics:
- A business "North Star" refers to a single, overarching metric that guides a company's strategic direction - a guiding principle, like the actual North Star, that helps a company stay on course and achieve long-term success**.**
- Track correlation with key business indicators (DAU - Daily Active User, session duration - user engagement time length with the app, subscriptions). Strong correlations with your north star offer insights for improvement; weak correlations indicate metrics that might not be worth optimizing.
Latency Metrics Recap:
- Time to First Token (TTFT): Time until the first token is generated.
- Time Per Output Token (TPOT): Time (average) to generate each subsequent output token.
- Total Latency: Overall time to complete a response.
- Tools and Techniques:
  - Benchmarking tools: Several tools can automate the measurement of TTFT and TPOT, such as vLLM, GenAI-Perf, and LLMPerf.
  - Monitoring tools: Tools like New Relic can be used to track these metrics in real-time within a production environment.
  - Custom scripts: You can also write your own scripts to measure these metrics by recording timestamps and performing calculations based on your LLM's output.
- Track these per user to assess scalability with more users.
Cost Metrics:
- Number of queries.
- Volume of input and output tokens (e.g., tokens per second - TPS).
- Requests per second (RPS) for API rate limit management.
Metric Calculation:
- Spot Checks: Sampling a subset of data for quick issue identification.
- Exhaustive Checks: Evaluating every request for comprehensive performance view. (Often a combination is used).
Granularity: Break down metrics by:
- Users
- Releases/Deployments
- Prompt/Chain Versions
- Prompt/Chain Types
- Time
- This helps pinpoint performance variations and specific issues.

Logs and Traces

Logs:
- Nature: Append-only record of events; provide detailed historical context. Answer "Has this happened before?".
- Debugging Process: Metrics alert to a problem (e.g., spike in activity); logs help pinpoint what happened at that time; correlation confirms the issue.
- Accessibility: Logs must be readily available and accessible for fast response (no 15-minute delays).
- Logging Strategy: "Log everything."
  - Configurations: Model API endpoint, model name, sampling settings (temperature, top-p, top-k, stop conditions).
  - Prompts: User query, final prompt sent to model, intermediate outputs.
  - Outputs: Model response.
  - Tooling: Tool calls, tool outputs.
  - System Events: Component start/end, crashes.
  - Metadata: Tags and IDs to trace origin within the system.
- Management: Large log volumes necessitate automated analysis and AI-powered anomaly detection tools.
- Manual Inspection: Daily manual review of production data is vital for developers to understand user behavior, refine prompts, and update evaluation pipelines (Shankar et al., 2024).
Traces:
- Nature: Reconstructed, linked events forming a complete timeline of a transaction. Show the path of a request through various components.
- In AI Applications: Reveal the entire flow from user query to final response, including:
  - System actions taken.
  - Documents retrieved.
  - Final prompt constructed.
  - Time and cost associated with each step.
- Benefit: Allows step-by-step tracing of query transformation. If a query fails, enables precise identification of the faulty component (incorrect processing, irrelevant context, bad generation).
- Example: LangSmith's visualization of a request trace (Figure 11).

Visual Representation (Figure 11 - LangSmith Trace):
(This figure is an example of a visual output from a tool and is not a system architecture diagram. It shows a hierarchical breakdown of an agent's execution, including LLM calls, tool usage, total tokens, and time taken for each step.)

Drift Detection

Drift refers to unexpected changes in an AI system's behavior or environment, which become more likely with system complexity.

Types of Drifts:
- System Prompt Changes: Updates to underlying prompt templates or manual fixes by coworkers can alter system behavior (prompt) without our explicit awareness. Simple logic can detect these.
- User Behavior Changes: Users adapt to technology, learning to phrase queries differently for better results (e.g., more concise prompts). This can cause gradual metric shifts (e.g., shorter response lengths) that require investigation to understand root causes.
- Underlying Model Changes: When using third-party APIs, model providers may update their models without disclosure. These "silent updates" can significantly impact performance, as observed in studies (e.g., Chen et al., 2023, on GPT-4/3.5 version changes; Voiceflow on GPT-3.5-turbo changes). Detecting these requires continuous monitoring.

Active Recall Question: Distinguish between logs and traces in the context of AI application observability. Why is drift detection crucial for AI applications?
Answer: Logs are sequential records of events, useful for understanding what happened at a specific time. Traces link these events to reconstruct the complete path of a single request through the system, pinpointing where failures occurred. Drift detection is crucial for AI applications because changes in system prompts, user behavior, or underlying model versions (especially with third-party APIs) can subtly or significantly alter performance and reliability without explicit notification, necessitating continuous monitoring.

AI Pipeline Orchestration – Structuring Complex Flows

An AI Pipeline Orchestrator is a tool that helps define and manage how various components (models, databases, tools) work together to form an end-to-end AI application. It ensures seamless data flow and coordination.

Two Core Steps

Components Definition:
- Inform the orchestrator about all system components: different models (potentially via a model gateway), external data sources for retrieval, and available tools (including those for evaluation and monitoring).
Chaining (Pipelining):
- Defines the sequential flow of operations, akin to function composition.
- Example Pipeline Steps:
  1. Process the raw user query.
  2. Retrieve relevant data based on the processed query.
  3. Combine the original query and retrieved data into a model-ready prompt.
  4. The model generates a response.
  5. Evaluate the response.
  6. If the response is good, return it; otherwise, route the query to a human operator.
- Responsibilities: The orchestrator manages data passing between components and provides tools to ensure output formats from one step match input formats for the next. It should also notify on disruptions (e.g., component failures, data mismatches).

⚠️ WARNING: An AI pipeline orchestrator is distinct from general workflow orchestrators like Airflow or Metaflow, which are designed for broader data processing workflows, not specifically AI model pipelines.

Design Considerations

Parallelism for Latency: For applications with strict latency requirements, design pipelines to perform as many steps in parallel as possible (e.g., routing (deciding where to send a query) and PII removal can often happen concurrently).
Orchestration Tools:
- Many tools are available: LangChain, LlamaIndex, Flowise, Langflow, Haystack.
- Note that many RAG and agent frameworks inherently include orchestration capabilities.

Considerations Before Adoption

While tempting to use from the start, it's often advisable to build your application without an orchestrator first. External tools add complexity and can abstract away critical details, making debugging harder. Adopt one when your application genuinely needs it.

When evaluating orchestrators, consider:

Integration and Extensibility:
- Does it support your current and future components (models, databases, frameworks)?
- How easy is it to extend if a specific component isn't natively supported?
Support for Complex Pipelines:
- Can it handle advanced features like branching, parallel processing, and robust error handling as your application grows?
Ease of Use, Performance, and Scalability:
- User-Friendliness: Intuitive APIs, comprehensive documentation, strong community support.
- Performance: Avoid orchestrators that introduce hidden API calls or significant latency.
- Scalability: Ensure it can scale effectively with increasing applications, developers, and traffic.

Active Recall Question: What are the two main steps involved in AI pipeline orchestration, and what key aspects should be evaluated before adopting an orchestration tool?
Answer: The two main steps are Components Definition (declaring models, data sources, tools) and Chaining (Pipelining) (defining the sequential flow of operations). Before adoption, evaluate its integration and extensibility with existing/future components, its support for complex pipelines (branching, parallelism, error handling), and its ease of use, performance, and scalability. It's often advisable to start without one to avoid premature complexity.

User Feedback: The Data Flywheel and Product Development

User feedback is paramount in AI applications, serving two crucial roles beyond traditional software: evaluating performance and informing development. For AI, it also acts as a source of proprietary data, which is a significant competitive advantage, fueling the data flywheel (as discussed in Dataset Engineering).

Importance of User Feedback in AI:
- Model Personalization: Tailoring models for individual users based on their interactions.
- Future Model Training: Proprietary user data is invaluable for training and fine-tuning subsequent model iterations, especially as data becomes scarcer.
- Competitive Advantage: Early product launches that attract users can collect data to continuously improve models, creating a difficult-to-match lead for competitors.
Ethical Considerations: User feedback is user data and must be handled with care. User privacy must be respected, and users have the right to know how their data is being used.

Extracting Conversational Feedback: New Opportunities and Challenges

Traditionally, feedback is categorized as either explicit or implicit. AI applications, particularly those with conversational interfaces, introduce new nuances to these categories.

Explicit Feedback:
- Definition: Information users consciously provide in response to direct requests.
- Examples: Thumbs up/down, upvote/downvote, star ratings, "Did we solve your problem?" (yes/no).
- Characteristics: Fairly standard across applications, well-understood.
Implicit Feedback:
- Definition: Information inferred from user actions, rather than direct statements.
- Traditional Examples: Purchasing a recommended product (implies good recommendation).
- AI-Specific (Novel) Implicit Feedback: Foundation models enable new application genres, leading to novel forms of implicit feedback.
The Conversational Advantage:
- Conversational interfaces make it easier for users to provide feedback naturally, similar to daily dialogues.
- The language users employ to direct AI can implicitly convey feedback on both performance and preference.
- Example (Trip Planning Assistant):
  - User asks for hotel recommendations in Sydney for 3 nights.
  - Assistant provides three options.
  - User Response & Inferred Feedback:
    - "Yes book me the one close to galleries" → Indicates interest in art.
    - "Is there nothing under $200?" → Reveals price-conscious preference, suggests assistant missed user's implicit need.
Utilization of Extracted Feedback:
- Evaluation: Deriving metrics to monitor application performance.
- Development: Training future models or guiding their design.
- Personalization: Tailoring the application experience for individual users.
Challenges:
- Feedback is blended into daily conversations, making extraction difficult.
- Requires rigorous data analysis and user studies to understand its true meaning, beyond initial intuition.
Historical Context: Conversational feedback has been an active research area since the late 2010s, with efforts in reinforcement learning (e.g., Fu et al., Goyal et al., Zhou and Small, Sumers et al.) and early conversational AI applications (e.g., Amazon Alexa - Ponnusamy et al.; Park et al., Spotify Voice - Xiao et al., Yahoo! Voice - Hashimoto and Sassano).

Natural Language Feedback (Content-Based)

Feedback derived directly from the content of user messages. Tracking these signals in production helps monitor application performance.

1. Early Termination:
- Signal: User stopping generation midway, exiting the app, telling the model to stop, or not responding to the agent.
- Implication: Strong indicator that the conversation is not going well.
2. Error Correction:
- Signal: User follow-ups starting with "No,..." or "I meant,...", rephrasing requests, or directly pointing out model errors.
- Implication: Model's response was likely inaccurate or misunderstood the intent.
- Example (Figure 12): User's initial query "weather in mission bay today" resulted in a San Francisco prediction. User then rephrases with "mission bay san diego" after the first response, signaling a misunderstanding.
- Specific Corrections: Users might correct factual errors ("Bill is the suspect, not the victim"), or nudge agents towards better actions ("You should also check XYZ GitHub page").
- Confirmation Requests: "Are you sure?", "Check again", "Show me the sources" suggest a lack of detail or user distrust, even if the answer is correct.
- Direct Edits: Users directly editing model-generated content (e.g., code).
  - Strong Signal: The original content was incorrect.
  - Preference Data Source: The original (edited) response becomes the "losing response," and the edited version becomes the "winning response" for preference fine-tuning.
3. Complaints:
- Signal: Users express dissatisfaction without attempting to correct, stating the answer is wrong, irrelevant, toxic, too long, lacking detail, or simply "bad."
- FITS Dataset Analysis (Xu et al., 2022; Yuan et al., 2023):
  1. Clarify demand again (26.54%): User reiterating their request.
  2. Complain about non-answer/irrelevant info/directs the user to find out yourself (16.20%): Model failed to address the core question.
  3. Point out specific search results (16.17%): User directs model to better information.
  4. Suggest bot use search results (15.27%): User prompts model to perform a search.
  5. State factual incorrectness/ungrounded (11.27%): Model hallucinated or wasn't supported by context.
  6. Point out lack of specificity/accuracy/completeness/detail (9.39%): Response quality issue.
  7. Complain about low confidence (4.17%): Model frequently uses phrases like "I am not sure" or “I don’t know”.
  8. Complain about repetition/rudeness (0.99%): Usability/politeness issue.
- Actionable Insights: Understanding how the bot fails helps direct improvements (e.g., if answers are too verbose, change prompt for conciseness; if lacking detail, prompt for specificity).
4. Sentiment:
- Signal: General expressions of emotion (frustration, disappointment, ridicule) like "Uggh."
- Insight: Analyzing sentiment throughout a conversation can reveal how the bot is performing (e.g., user starting angry, ending happy, suggests problem resolution). Call centers track voice sentiment for similar reasons.
Model's Refusal Rate:
- Signal: Model responses like "Sorry, I don't know that one" or "As a language model, I can't do..."
- Implication: User is likely unhappy due to task non-completion.

Other Conversational Feedback (Action-Based)

Feedback inferred from user actions rather than explicit messages.

1. Regeneration:
- Signal: User chooses to generate another response (sometimes with a different model).
- Implication: Could indicate dissatisfaction with the first response, or a desire for options/comparison (common in creative tasks like image/story generation).
- Cost Impact: Stronger signal if usage-based billing makes regeneration incur extra cost, implying true dissatisfaction rather than idle curiosity.
- Consistency Check: Users might regenerate for complex requests to verify consistency between multiple model outputs. If two responses give contradicting answers, we can’t trust either.
- Comparative Feedback (Figure 13): After regeneration, applications might explicitly ask users to compare the new response with the previous one ("Better," "Worse," "Same"). This data is valuable for preference fine-tuning.
2. Conversation Organization:
- Signal: User actions like deleting, renaming, sharing, or bookmarking conversations.
- Implications:
  - Deleting: Strong signal of a bad conversation (unless for privacy/embarrassment).
  - Renaming: Conversation is good, but the auto-generated title was poor.
  - Sharing/Favoriting: Generally positive signals, indicating value.
3. Conversation Length:
- Signal: Number of turns per conversation.
- Interpretation (Application-Dependent):
  - AI Companions: Long conversations might indicate user enjoyment.
  - Productivity/Customer Support Chatbots: Long conversations might signal inefficiency in resolving issues.
4. Dialogue Diversity:
- Signal: Measured by distinct token or topic count, interpreted alongside conversation length.
- Implication: If a conversation is long but the bot is repetitive, the user might be stuck in a loop.

Challenges with Feedback Types

Explicit Feedback:
- Pros: Easier to interpret.
- Cons: Sparse (users often unwilling to make extra effort), susceptible to response biases (e.g., unhappy users more likely to complain).
Implicit Feedback:
- Pros: More abundant (limited only by imagination).
- Cons: Noisier, challenging to interpret (e.g., sharing a conversation can be positive or negative depending on user's intent). For example, one user can share conversations when the model has made some glaring mistakes, and another user mostly shares useful conversations with their coworkers. It’s important to study your users to understand why they do each action.
Mitigation: Combining multiple signals (e.g., user rephrases after sharing a link) can clarify intent. Research in extracting and interpreting implicit conversational feedback is ongoing.

Active Recall Question: Provide two examples of "natural language feedback" and two examples of "action-based conversational feedback." What is a key trade-off between explicit and implicit feedback?
Answer:

Natural Language Feedback Examples: "No, I meant..." (error correction), "Too cliche" (complaint).
Action-Based Conversational Feedback Examples: User clicking "Regenerate" (regeneration), User deleting a conversation (conversation organization).
Trade-off: Explicit feedback is easier to interpret but often sparse and subject to response biases. Implicit feedback is abundant but noisier and harder to interpret accurately.

Feedback Design: When and How to Collect

Effective feedback design seamlessly integrates into the user workflow, is non-intrusive, and provides incentives for thoughtful input.

When to Collect Feedback

Feedback should be available throughout the user journey, especially when specific needs arise, while remaining non-intrusive.

1. In the Beginning (Calibration):
- Purpose: Calibrate the application to user preferences/needs.
- Examples: Face ID apps scanning faces, voice assistants asking for voice samples, language learning apps gauging skill level.
- Consideration:
  - Necessary Calibration: (e.g., Face ID) is unavoidable.
  - Optional Feedback: For other apps, initial feedback should be optional to reduce friction; default to a neutral option and calibrate over time.
2. When Something Bad Happens (Error Reporting/Recovery):
- Purpose: Allow users to report failures and recover.
- Examples: Downvoting a response, regenerating with the same/different model.
- Conversational Feedback: "You're wrong," "Too cliche," "I want something shorter."
- Goal: Users should ideally still accomplish tasks despite model mistakes.
  - Human-AI Collaboration: Users can edit AI outputs (e.g., correcting a product category). If AI fails, transfer to human agents (common in customer support).
  - Inpainting (Figure 14): In image generation (like DALL-E), users can select a region and prompt for specific edits. This provides high-quality feedback while allowing users to refine outputs.
3. When the Model Has Low Confidence (Comparative Feedback):
- Purpose: Increase model confidence by asking users for clarification or preference when unsure about an action.
- Example (Summarization): If unsure between a short or detailed summary, the model can output both side-by-side (if latency isn't impacted).
- Comparative Evaluation (Figure 15): ChatGPT asks "Which response do you prefer?" for side-by-side options, generating preference fine-tuning data.
- Partial Responses (Figure 16): Google Gemini shows partial responses; clicking to expand acts as an implicit signal of preference. It’s unclear, however, whether showing full or partial responses side by side gives more reliable feedback.
- Uncertainty in Tagging (Figure 17): Google Photos asks "Same or different person?" when unsure if two images contain the same individual. An example is a photo organization application that automatically tags your photos, so that it can respond to queries like “Show me all the photos of X”.
Positive Feedback:
- Actions: Thumbs up, favoriting, sharing.
- Debate: Apple's guideline warns against asking for both positive and negative feedback, as it might imply good results are exceptions.
- Counter-argument: Some product managers seek positive feedback to identify "loved" features, allowing teams to focus resources effectively.
- Mitigation: Limit frequency of positive feedback requests to avoid annoying users (e.g., if you have a large user base, show to only 1% of users). Be mindful of sample size and potential biases. But with a large enough pool, the feedback can provide meaningful product insights.

How to Collect Feedback

Feedback collection should be seamless, easy to ignore, and integrate into the user's workflow.

Seamless Integration:
- Midjourney (Figure 18): Generates 4 images and offers clear options:
  - U1-U4 (Upscale): Strong positive signal for the chosen image.
  - V1-V4 (Variations): Weaker positive signal, but still indicates potential.
  - Regenerate: Signals dissatisfaction with all options (though users might choose it just to explore).
- Code Assistants (GitHub Copilot - Figure 19): Drafts appear in lighter colors.
  - Accept (Tab key): Positive signal for suggestion.
  - Continue Typing (Ignore): Negative signal for suggestion.
Challenge for Standalone Apps (e.g., ChatGPT):
- Lack of integration into daily workflows makes high-quality feedback collection harder. ChatGPT doesn't know if a generated email was actually sent or used.
Context for Feedback:
- While simple metrics (thumbs up/down) are useful for analytics, deeper analysis requires context (e.g., previous 5-10 dialogue turns).
- Privacy Concern: Providing context often involves personally identifiable information (PII), requiring explicit user consent.
User Consent & Data Donation:
- Products may include terms of service allowing data access for analytics/improvement.
- Alternatively, users might be asked to "donate" (share) recent interaction data along with their feedback. For example, when submitting feedback, a user might be asked to check a box to share his/her recent data as context for this feedback.
Motivation: Explain to users how their feedback will be used (personalization, statistics, model training) and reassure them about privacy concerns that their data won’t be used to train models or won’t leave their device (only if these are true).
Avoid Impossible Choices:
- Figure 20 (ChatGPT statistical question): Don't ask users to choose between complex options they don't understand, especially for factual questions. An "I don't know" option would be helpful. An example of ChatGPT asking a user to select “Which response do you prefer?“. However, for mathematical questions like this, the right answer shouldn’t be a matter of preference.
Clear Design & Avoid Ambiguity:
- Use icons and tooltips. Avoid confusing designs.
- Figure 21 (Luma emoji error): An angry emoji for a 1-star rating placed where 5-star should be led to misinterpretation.
Public vs. Private Feedback:
- Private Feedback: Users are generally more candid, leading to higher-quality signals (e.g., X making "likes" private led to uptick in likes).
- Public Feedback: Can increase discoverability and explainability, but may reduce candidness due to fear of judgment (e.g., Midjourney's early public upscales). The choice impacts user behavior, experience, and feedback quality.

Active Recall Question: When is it particularly valuable to collect user feedback during the AI application's user journey? What is a design pitfall to avoid when collecting feedback, especially for technical questions?
Answer: It's particularly valuable to collect feedback: at the beginning (for calibration), when something bad happens (for error reporting/recovery), and when the model has low confidence (for comparative feedback). A design pitfall to avoid is asking users to choose between options they don't understand, especially for factual or technical questions, as this leads to noisy data and user frustration. Provide an "I don't know" option if applicable.

Feedback Limitations: Biases and Degenerate Loops

While immensely valuable, user feedback is not without its limitations and potential pitfalls. It's crucial to understand these biases to design effective systems.

Biases in User Feedback

Like any data, user feedback is susceptible to various biases:

1. Leniency Bias:
- Definition: Tendency to rate items more positively than deserved, often to avoid conflict, appear nice, or choose the path of least resistance (e.g., avoiding prompts for reasons for negative feedback).
- Example: Uber drivers having average ratings of 4.8, with scores below 4.6 being problematic, despite 5-star being "excellent." This suggests users often give 5 stars as a default.
- Mitigation: Reframe rating scales to be more descriptive than numerical (e.g., "Great ride" vs. "5 stars") to encourage more granular and honest feedback.
2. Randomness:
- Definition: Users providing arbitrary feedback due to lack of motivation or time, rather than thoughtful input.
- Examples: Randomly clicking on one of two long responses in a comparative evaluation, or choosing a random image variation in Midjourney.
3. Position Bias:
- Definition: The position of an option or suggestion influencing user choice, often favoring the first option presented.
- Implication: A click on the first suggestion doesn't necessarily mean it's the best suggestion.
- Mitigation: Randomly vary the positions of suggestions, or use models to compute a "true" success rate adjusting for position bias.
4. Preference Bias:
- Definition: Various psychological biases affecting user preferences.
- Examples:
  - Length Preference: Users might prefer longer responses in side-by-side comparisons, even if they are less accurate (length is easier to perceive than accuracy).
  - Recency Bias: Favoring the last seen answer when comparing multiple options.
- Importance: Inspecting feedback to uncover these biases is critical for accurate interpretation and avoiding misleading product decisions.

Degenerate Feedback Loops

Concept: A critical issue where a system's predictions or outputs influence the user feedback, which in turn reinforces and amplifies initial biases in subsequent model iterations. This creates a self-perpetuating cycle.
Mechanism: User feedback is inherently incomplete; the system only gets feedback on what it shows users.
Example (Video Recommendation - "Exposure Bias," "Popularity Bias," "Filter Bubbles"):
- A video (A) is initially ranked slightly higher than video (B).
- Because A is shown more, it receives more clicks/views.
- The system interprets these clicks as positive feedback, reinforcing A's ranking.
- Over time, A's popularity soars, while B remains overlooked, not due to inherent quality, but due to the system's initial bias being amplified by feedback. This makes it hard for new/lesser-known content to gain traction.
Impact on Product Focus:
- A small initial preference for "cat photos" can lead the system to generate more cat photos.
- This attracts more "cat lovers," who provide more positive feedback on cat photos.
- The loop amplifies, potentially turning the application into a "cat haven," but also capable of amplifying harmful biases like racism, sexism, or preference for explicit content.
Sycophancy in Models:
- Studies (Stray, 2023; Sharma et al., 2023) show that models trained on human feedback can learn to prioritize giving users what they think users want (sycophancy), even if it's less accurate or beneficial. This means the model might "lie" to please the user.
Conclusion: User feedback is vital for improving user experience, but its indiscriminate use can perpetuate biases and fundamentally damage the product. A thorough understanding of its limitations and potential impacts is essential before incorporation.

Active Recall Question: Explain the concept of a "degenerate feedback loop" in AI applications using an example. How can this phenomenon impact the product's quality or focus?
Answer: A degenerate feedback loop occurs when an AI system's outputs or predictions influence the user feedback it receives, which then reinforces the initial biases in subsequent model iterations. For example, if a video recommendation system initially slightly favors video A, it shows A more often. Users click A more because it's prominent. This feedback tells the system A is good, so it shows A even more, creating a cycle where A becomes disproportionately popular, regardless of true merit. This can narrow the product's focus (e.g., only "cat photos" if initial feedback was biased) or lead to models that prioritize sycophancy (telling users what they want to hear) over accuracy.

Summary: The Holistic View of AI Engineering

This study note shifted from individual AI engineering techniques to their integration into a holistic process for building successful foundation model applications.

AI Application Architecture (Part 1):
- Presented a common, iterative architecture, starting simple and progressively adding components (context, guardrails, routing, gateway, caching, agents).
- Emphasized the challenges at each step and corresponding solutions.
- Modularity vs. Fluidity: While components are separated for modularity and maintainability, their functionalities can overlap (e.g., guardrails in inference service, gateway, or standalone).
- Complexity Trade-off: Each added component increases capability but also complexity, introducing new failure modes.
Monitoring and Observability:
- Highlighted as integral, not an afterthought.
- Involves understanding failure modes, designing metrics and alerts, and ensuring systems are designed for detectability and traceability.
- Acknowledged applicability of traditional software/ML observability practices but noted new failure modes unique to foundation models requiring specific metrics and considerations.
User Feedback (Part 2):
- Discussed how conversational interfaces enable new types of user feedback.
- Explored various forms of natural language and action-based conversational feedback.
- Provided guidelines for effective feedback design (when and how to collect).
- Critical Awareness: Stressed understanding feedback limitations, including various biases and the risk of degenerate feedback loops.
AI Engineering as a Product Function:
- Traditionally, user feedback was a product responsibility. However, with AI, engineers are increasingly involved due to feedback's critical role as data for continuous model improvement (the "data flywheel").
- This reinforces the idea that AI engineering is moving closer to product development, driven by the importance of data and user experience as competitive advantages.
System-Level Thinking:
- Many AI challenges are fundamentally system problems.
- Solving them often requires a holistic view, where problems might be addressed by independent components or require collaboration across multiple parts of the system.
- A thorough understanding of the entire system is essential for problem-solving, innovation, and ensuring safety.

AI Engineering Architecture & User Feedback - Study Notes

Table of contents