Welcome to a deep dive into one of the most critical and fascinating areas of AI Engineering: Inference Optimization. While building powerful models is one part of the equation, making them run efficiently—faster, cheaper, and at scale—is what makes them viable in the real world. Whether you're building your own inference service or evaluating a third-party API, understanding these principles is non-negotiable.

This guide breaks down the core concepts, metrics, and techniques that turn a slow, expensive model into a production-ready powerhouse.

Note: This article is my personal rewritten version of concepts I’ve studied from Chip Huyen’s book, “AI Engineering.” It is not an official summary or reproduction — rather, a learning exercise to reinforce and share what I’ve learned.

Key Concepts

Before we dive in, let's anchor ourselves with the central ideas of this article:

Inference vs. Training: The two phases of a model's lifecycle. We focus on inference—using a trained model for predictions.
Inference Overview: In production, the component that runs model inference is called an inference server. It hosts the available models and has access to the necessary hardware. Based on requests from applications (e.g., user prompts), it allocates resources to execute the appropriate models and returns the responses to users. An inference server is part of a broader inference service, which is also responsible for receiving, routing, and possibly preprocessing requests before they reach the inference server. Model APIs like those provided by OpenAI and Google are inference services. If you use one of these services, you won’t be implementing most of the techniques discussed in this article. However, if you host a model yourself, you’ll be responsible for building, optimizing, and maintaining its inference service. A visualization of a simple inference service is shown in Figure 1.
Computational Bottlenecks: The two primary constraints on performance:
- Compute-Bound: Limited by the number of calculations (FLOP/s) a processor can perform - is determined by the computation needed for the tasks to complete. For example, password decryption is typically compute-bound due to the intensive mathematical calculations required to break encryption algorithms.
- Memory Bandwidth-Bound: Limited by the speed at which data can be moved between memory and the processor - the data transfer rate within the system. For example, if you store your data in the CPU memory and train a model on GPUs, you have to move data from the CPU to the GPU, which can take a long time. If you’re constrained by GPU memory and cannot fit an entire model into the GPU, you can split the model across GPU memory and CPU memory. This splitting will slow down your computation because of the time it takes to transfer data between the CPU and GPU. However, if data transfer is fast enough, this becomes less of an issue. Therefore, the memory capacity limitation is actually more about memory bandwidth.
LLM Inference Phases:
- Prefill (Compute-Bound): Processing the input prompt in parallel. How many tokens can be processed at once is limited by the number of operations your hardware can execute in a given time. Therefore, prefilling is compute-bound.
- Decode (Memory-Bound): Generating output tokens one by one (autoregressively; one output token at a time). At a high level, this step typically involves loading large matrices (e.g., model weights) into GPUs, which is limited by how quickly your hardware can load data into memory. Decoding is, therefore, memory bandwidth-bound.
- Because prefill and decode have different computational profiles, they are often decoupled in production with separate machines.
- Today, due to the prevalence of the transformer architecture and the limitations of the existing accelerator technologies, many AI and data workloads are memory bandwidth-bound. However, future software and hardware advancements will be able to make AI and data workloads compute-bound.
Online and batch inference APIs
- Online APIs optimize for latency. Requests are processed as soon as they arrive. Customer-facing use cases, such as chatbots and code generation, typically require lower latency, and, therefore, tend to use online APIs.
- Batch APIs optimize for cost. If your application doesn’t have strict latency requirements, you can send them to batch APIs for more efficient processing. Higher latency allows a broader range of optimization techniques, including batching requests together and using cheaper hardware. For example, as of this writing, both Google Gemini and OpenAI offer batch APIs at a 50% cost reduction and significantly higher turnaround time, i.e., in the order of hours instead of seconds or minutes. Use cases with less stringent latency requirements, which are ideal for batch APIs, include the following:
  - Synthetic data generation;
  - Periodic reporting, such as summarizing Slack messages, sentiment analysis of brand mentions on social media, and analyzing customer support tickets;
  - Onboarding new customers who require processing of all their uploaded documents;
  - Migrating to a new model that requires reprocessing of all the data;
  - Generating personalized recommendations or newsletters for a large customer base;
  - Knowledge base updates by reindexing an organization’s data.
- APIs usually return complete responses by default. However, with autoregressive decoding, it can take a long time for a model to complete a response, and users are impatient. Many online APIs offer streaming mode, which returns each token as it’s generated. This reduces the time the users have to wait until the first token. The downside of this approach is that you can’t score a response before showing it to users, increasing the risk of users seeing bad responses. However, you can still retrospectively update or remove a response as soon as the risk is detected.
- WARNING: A batch API for foundation models differs from batch inference for traditional ML. In traditional ML:
  - Online inference means that predictions are computed after requests have arrived.
  - Batch inference means that predictions are precomputed before requests have arrived.
- Pre-computation is possible for use cases with finite and predictable inputs like recommendation systems, where recommendations can be generated for all users in advance. These precomputed predictions are fetched when requests arrive, e.g., when a user visits the website. However, with foundation model use cases where the inputs are open-ended, it’s hard to predict all user prompts.
Core Performance Metrics:
- Latency: Time to First Token (TTFT) and Time Per Output Token (TPOT).
- Throughput: Tokens per second (TPS), a proxy for cost efficiency.
- Utilization: Model FLOP/s Utilization (MFU) and Model Bandwidth Utilization (MBU), which measure true hardware efficiency.
Key Optimization Strategies:
- Model-Level: Quantization, Pruning, Speculative Decoding, Attention Optimization (KV Cache).
- Service-Level: Batching (Static, Dynamic, Continuous), Parallelism (Tensor, Pipeline), Prompt Caching.

Definitions & Equations

1. Computational Bottlenecks

Inference Server: The software component that runs model inference, hosting models, and managing hardware resources to serve requests.
Compute-Bound: A task whose completion time is limited by the processor's calculation speed (FLOP/s).
Memory Bandwidth-Bound (or Memory-Bound): A task whose completion time is limited by the rate of data transfer between memory (e.g., HBM) and the processor cores.
Arithmetic Intensity: The ratio of arithmetic operations to memory access operations for a given task. It's the key factor in determining if a task is compute- or memory-bound.
- Formula: Arithmetic Intensity = FLOPs / Byte
- Low Intensity: Memory-bound (many memory accesses per calculation).
- High Intensity: Compute-bound (many calculations per memory access).

Figure 2: The roofline chart visualizes whether an operation is compute-bound or memory-bound. The "roofline" is formed by the hardware's peak FLOP/s (the flat part) and its peak memory bandwidth (the slanted part). Workloads below the line are limited by their position on the chart.

2. LLM Inference Metrics

Time to First Token (TTFT): The time from when a user sends a query until the first output token is generated. This corresponds to the prefill step and depends on the input’s length.
- User Expectations: Users might have different expectations for TTFT for different applications.
  - Chatbots/Real-time Interaction: Users expect near-instantaneous TTFT, making the interaction feel fluid and responsive.
  - Batch Tasks (e.g., Document Summarization): Users may tolerate a longer TTFT, as the primary goal is the complete output, not immediate feedback.
- Impact of Intermediate Steps:
  - Scenario: Consider queries involving Chain-of-Thought (CoT) or Agentic Queries. The model might perform internal "planning" or "action" steps before generating the first token the user sees.
  - Model Perspective: The model might consider the first token generated after internal planning as its "first token."
  - User Perspective: The user only sees the first token of the final response, which occurs after all internal steps are completed. This leads to a longer observed TTFT for the user compared to the model's internal measurement. To clarify this, some teams use the term "time to publish" to specifically denote the TTFT as perceived by the end-user.
Time Per Output Token (TPOT): The average time taken to generate each subsequent token after the first one. This corresponds to the decode step. If each token takes 100 ms, a response of 1,000 tokens will take 100 s. In the streaming mode, where users read each token as it’s generated, TPOT should be faster than human reading speed but doesn’t have to be much faster. A very fast reader can read 120 ms/token, so a TPOT of around 120 ms, or 6–8 tokens/second, is sufficient for most use cases.
Time Between Tokens (TBT) / Inter-Token Latency (ITL): These are essentially synonyms for TPOT, emphasizing the time gap between consecutively generated tokens.
Total Latency: TTFT + (TPOT × Number of Output Tokens)
- User Experience Nuance: Two applications with the same total latency can offer different user experiences with different TTFT and TPOT. Would your users prefer instant first tokens with a longer wait between tokens, or would they rather wait slightly longer for the first tokens but enjoy faster token generation afterward? User studies will be necessary to determine the optimal user experience. Reducing TTFT at the cost of higher TPOT is possible by shifting more compute instances from decoding to prefilling and vice versa.
- The Importance of Percentiles for Latency:
  - Why Averages Can Be Misleading: Simply looking at the average TTFT or TPOT can be highly deceptive. A single, extremely slow request (an outlier) due to a transient network issue, a very long input prompt, or a rare internal model state can inflate the average, making the service appear much slower than it is for most users.
    - Example: A set of TTFTs: [100, 102, 100, 100, 99, 104, 110, 90, 3000, 95] ms. The average is 390 ms, suggesting a very slow service. However, 9 out of 10 requests were under 110 ms.
  - Percentiles as a Better Indicator: Latency is best understood as a distribution. Percentiles help analyze this distribution:
    - p50 (Median): The 50th percentile. Half of the requests are faster than this value, and half are slower. It provides a good sense of typical performance.
    - p90, p95, p99: These percentiles indicate the latency for the slowest 10%, 5%, and 1% of requests, respectively. They are crucial for identifying and addressing outliers and ensuring a good experience for the vast majority of users.
    - Visualizing with Input Length: Plotting TTFT values against input lengths can reveal if longer inputs are disproportionately increasing latency, guiding optimization efforts towards the prefill stage.
Throughput: This metric quantifies the overall rate at which output tokens are produced by the system (inference service) across all users and requests - the number of output tokens per second.
- Definition: Measured in tokens per second (TPS), it reflects the system's capacity to generate output.
- Input vs. Output Throughput: In modern inference servers, especially those decoupling prefill and decode, it's important to distinguish between input throughput (tokens/sec (TPS) for prefilling) and output throughput (tokens/sec (TPS) for decoding). When "throughput" is mentioned without a modifier, it typically refers to output tokens per second (TPS).
- Scaling: To understand how the system scales with user load, throughput can also be expressed as tokens/second/user.
- Alternative Measures: For foundation models with potentially long response times, Requests Per Minute (RPM) or Completed Requests Per Minute (CRPM) are sometimes used instead of Requests Per Second (RPS) to better capture system handling of concurrent tasks. Tracking this metric is useful for understanding how an inference service handles concurrent requests. Some providers might throttle your service if you send too many concurrent requests at the same time.
- Cost Linkage: Throughput is directly tied to cost. Higher throughput generally leads to a lower cost per token or per request.
  - Example: If your system costs $2/h in compute and its throughput is 100 tokens/s, it costs around $5.556 per 1M output tokens. If each request generates 200 output tokens on average, the cost for decoding 1K requests would be $1.11. The prefill cost can be similarly calculated. If your hardware costs $2 per hour and it can prefill 100 requests per minute, the cost for prefilling 1K requests would be $0.33. The total cost per request is the sum of the prefilling and decoding costs. In this example, the total cost for 1K requests would be $1.11 + $0.33 = $1.44.
  - Cost Calculation Breakdown (Throughput, Prefill, and Decoding): The core idea is to translate raw performance metrics (throughput, requests per minute) into cost per unit of work (per token, per request).
    - Decoding Cost Calculation
      - Given Information:
        
        System Compute Cost: $2 per hour
        
        Decoding Throughput: 100 output tokens per second (TPS)
        
        Average Tokens per Request: 200 tokens
        
        Target: Cost per 1 Million (1M) output tokens
      - Step 1: Calculate Cost per Second of Compute
        
        Your system costs $2 per hour.
        
        There are 3600 seconds in an hour.
        
        Cost per second = $2 / 3600 seconds ≈ $0.0005556 per second.
      - Step 2: Determine How Many Tokens Are Processed per Second of Compute
        
        Your system's decoding throughput is 100 tokens/second. This means for every second your system is running and decoding, it produces 100 tokens.
      - Step 3: Calculate the Cost to Produce 1 Token
        
        Cost per token = (Cost per second of compute) / (Tokens produced per second)
        
        Cost per token = $0.0005556 / 100 tokens ≈ $0.000005556 per token.
      - Step 4: Calculate the Cost per 1 Million Output Tokens
        
        Cost per 1M tokens = (Cost per token) × 1,000,000 tokens
        
        Cost per 1M tokens = 0.000005556 × 1,000,000 ≈ 5.556 per 1M output tokens.
      - Step 5: Calculate the Cost for 1K Requests (Decoding Only)
        
        First, find the total number of output tokens for 1,000 requests.
        
        Total output tokens = (Number of requests) × (Average tokens per request)
        
        Total output tokens = 1,000 requests × 200 tokens/request = 200,000 tokens.
        
        Now, calculate the cost for these tokens using the cost per token:
        
        Cost for 1K requests (decoding) = (Cost per token) × (Total output tokens)
        
        Cost for 1K requests (decoding) = 0.000005556 × 200,000 ≈ 1.11.
    - Prefill Cost Calculation
      - Given Information:
        
        System Compute Cost: $2 per hour
        
        Prefill Rate: 100 requests per minute
      - Step 1: Convert System Cost to Per-Minute
        
        System cost per hour: $2
        
        There are 60 minutes in an hour.
        
        Cost per minute = $2 / 60 minutes ≈ $0.03333 per minute.
      - Step 2: Calculate the Cost to Prefill 1 Request
        
        The system can prefill 100 requests per minute.
        
        Cost per prefill = (Cost per minute of compute) / (Requests prefilled per minute)
        
        Cost per prefill = $0.03333 / 100 requests ≈ $0.0003333 per request.
      - Step 3: Calculate the Cost for 1K Requests (Prefill Only)
        
        Cost for 1K requests (prefill) = (Cost per prefill) × 1,000 requests
        
        Cost for 1K requests (prefill) = 0.0003333 × 1,000 ≈ 0.33.
    - Total Cost per Request
      - Given Information:
        
        Decoding cost per 1K requests: $1.11
        
        Prefill cost per 1K requests: $0.33
      - Calculation:
        
        Total cost per request = (Total cost for 1K requests) / 1000 requests
        
        Total cost per request = ($1.11 + $0.33) / 1000
        
        Total cost per request = 1.44/1000 = 0.00144 per request.
      - (The text implies the sum of costs for 1K requests is $1.44, meaning $1.11 for decoding 1K requests and $0.33 for prefilling 1K requests. This makes the total cost $1.44 for 1,000 full requests, where each request involves both prefilling and decoding. The calculation is showing that the combined cost for 1000 requests (1000 prefill operations + 1000 decode operations) is $1.44).
- Factors Influencing Throughput: Goodput is influenced by the model size, hardware capabilities (e.g., high-end chips generally yield higher throughput), and workload characteristics (e.g., consistent input/output lengths are easier to optimize).
- Comparison Challenges: Direct throughput comparisons across different models, hardware, or even tokenizers can be approximate, as what constitutes a "token" can vary. Comparing cost per request is often a more robust measure of efficiency.
- Latency-Throughput Trade-off: Like most software systems, inference services face a fundamental trade-off. Techniques like batching can boost throughput significantly (e.g., doubling or tripling it) but often at the cost of increased TTFT and TPOT (worse latency).
Goodput: A more user-centric metric that measures the number of requests per second (RPS) that successfully meet predefined Service-Level Objectives (SLOs).
- Definition: An SLO is a specific performance target, like "TTFT must be less than 200 ms" or "TPOT must be less than 100 ms."
- Contrast with Throughput: While throughput measures the total output capacity, goodput measures the useful capacity—how many requests are actually satisfying the user's performance expectations.
- Example: Imagine an application with SLOs of TTFT ≤ 200 ms and TPOT ≤ 100 ms. If an inference service processes 100 requests per minute (high throughput), but only 30 of those requests meet both SLOs, then the goodput of that service is 30 requests per minute. This metric provides a direct measure of how well the service is meeting user-perceived performance requirements.

3. Hardware Utilization Metrics

Beyond latency and throughput, understanding how efficiently your hardware is being used is critical for cost-effectiveness and performance.
General Utilization: This refers to the proportion of a resource's total capacity that is actively engaged in processing tasks.
The Pitfall of Standard GPU Utilization (e.g., nvidia-smi):
- What it Measures: Tools like nvidia-smi (SMI stands for System Management Interface) report "GPU Utilization" as the percentage of time the GPU is actively computing. For example, if a GPU is busy for 5 out of 10 hours, its utilization is 50%.
- The Misunderstanding: This metric doesn't reflect how much work is being done relative to the hardware's potential. A tiny task can keep a powerful GPU busy, reporting 100% utilization while performing only a fraction of its peak capability.
- Why It's Not Useful for Optimization: If you're paying for a machine that can do 100 operations/second but it's only doing 1 operation/second (while reporting 100% "busy time"), you are wasting money and performance.
Model FLOP/s Utilization (MFU): This is a more meaningful metric for AI workloads. It measures the actual computational efficiency relative to the hardware's theoretical peak.
- Definition: MFU is the ratio of the observed computational throughput (in FLOP/s) to the theoretical maximum FLOP/s the chip can achieve.
- Formula:

$$\text{MFU} = \frac{\text{Achieved FLOP/s}}{\text{Peak Theoretical FLOP/s}}$$

Example: If a chip's peak FLOP/s allows it to generate 100 tokens/s, but your inference service only achieves 20 tokens/s, your MFU is 20%. This directly tells you how much of the chip's raw computational power is being effectively utilized for your specific task.
Model Bandwidth Utilization (MBU): Similar to MFU, MBU focuses on the efficient use of memory bandwidth, which is often a critical bottleneck, especially for LLMs.
- Definition: MBU measures the percentage of the hardware's peak memory bandwidth that is actually consumed by the model's operations.
- Calculation for LLM Inference:
  1. Bandwidth Used: Parameter-Count × Bytes/Parameter × Tokens/s
  2. MBU Formula:

$$\text{MBU} = \frac{(\text{Parameter-Count} \times \text{Bytes/Parameter} \times \text{Tokens/s})}{\text{Peak Theoretical Bandwidth}}$$

- Example: A 7B parameter model in FP16 (2 bytes/parameter) running at 100 tokens/s uses approximately 7B × 2 × 100 = 1400 GB/s of bandwidth. If we consider the example of an A100-80GB GPU's peak memory bandwidth is 2 TB/s (2000 GB/s), the MBU is (1400 GB/s) / (2000 GB/s) = 70%.
Impact of Quantization: This example highlights why quantization is crucial. Using fewer bytes per parameter (e.g., INT4 instead of FP16) directly reduces the bandwidth requirement, improving MBU.
Relationship between Throughput, MFU, and MBU: MFU and MBU are directly proportional to throughput (tokens/s). Higher throughput achieved with the same hardware implies higher MFU and/or MBU.
Interpreting MFU and MBU:
- Workload Type:
  - Compute-Bound Workloads: Tend to have higher MFU (using most of the FLOP/s) and lower MBU (memory bandwidth is not the bottleneck).
  - Memory Bandwidth-Bound Workloads: Tend to have lower MFU (processor is waiting for data) and higher MBU (using most of the available bandwidth).
- Training vs. Inference: MFU for training is often higher than MFU for inference due to more predictable workloads and optimized batching strategies. For inference, prefill is typically compute-bound (higher MFU), while decode is memory-bound (higher MBU, lower MFU).
- Good Utilization Benchmarks: For model training, an MFU above 50% is generally considered good. Achieving high MFU/MBU in inference can be challenging and depends heavily on the model, hardware, and specific optimization techniques used.
- Figure 3: This figure shows that as the number of concurrent users increases (leading to a higher computational load per second), the MBU for Llama 2-70B decreases. This suggests that at higher concurrency, the workload might be shifting from being primarily bandwidth-bound to becoming more compute-bound, or that the available bandwidth is being saturated and cannot be further exploited.
The Goal of Optimization: While high utilization metrics are good indicators of efficiency, the ultimate goal is to achieve faster inference (lower latency) and lower costs. Simply maximizing utilization without improving these outcomes might be counterproductive. For instance, very high utilization achieved by sacrificing TTFT/TPOT could lead to a poor user experience.

4. KV Cache Size

KV Cache: A memory store used during autoregressive decoding to cache the Key (K) and Value (V) vectors of previous tokens, avoiding redundant computations.
KV Cache Size Calculation (Unoptimized):

$$2 \times B \times S \times L \times H \times M$$

Where:
- B: Batch size
- S: Sequence length
- L: Number of transformer layers
- H: Model hidden dimension
- M: Bytes per parameter (e.g., 2 for FP16)
This value can become substantial as the context length increases. For example, LLama 2 13B has 40 layers and a model dimension of 5,120. With a batch size of 32, sequence length of 2,048, and 2 bytes per value, the memory needed for its KV cache, without any optimization, is 2 × 32 × 2,048 × 40 × 5,120 × 2 = 54 GB.

5. AI Accelerators

The speed and cost of running AI software are fundamentally dictated by the underlying hardware. While general optimization techniques exist, a deeper understanding of hardware enables more profound optimizations. This section focuses on AI hardware from an inference perspective, though many principles apply to training as well.
A Symbiotic History: AI and Hardware Evolution
- The development of AI models and the hardware to run them has been a tightly coupled dance. The limitations of computing power were a significant factor in the "AI winters" of the past. Conversely, breakthroughs in hardware, such as the early adoption of GPUs for deep learning with AlexNet in 2012, were instrumental in the resurgence of AI. GPUs offered a massive leap in parallel processing capability over CPUs, making large-scale neural network training accessible to researchers and sparking the deep learning revolution.
What is an AI Accelerator?
- An AI Accelerator is a specialized piece of silicon designed to efficiently handle specific computational workloads associated with Artificial Intelligence.
  - Dominant Type: Graphics Processing Units (GPUs) are currently the most dominant type of AI accelerator, with NVIDIA being the leading economic force in this domain.
  - CPU vs. GPU: The Core Difference:
    - CPUs (Central Processing Units): Designed for general-purpose computing. They feature a few very powerful cores, optimized for high single-thread performance. CPUs excel at sequential tasks, complex logic, managing system operations (OS, I/O), and tasks that cannot be easily parallelized.
    - GPUs (Graphics Processing Units): Designed for highly parallel processing. They contain thousands of smaller, less powerful cores optimized for tasks that can be broken down into many identical, independent calculations. The quintessential example is matrix multiplication, which is fundamental to most machine learning operations and is inherently parallelizable.
  - Challenges of Parallelism: While GPUs offer immense computational power, their parallel nature introduces challenges in memory design (moving data efficiently to thousands of cores) and power consumption.
The Expanding Landscape of Accelerators
- The success of GPUs has spurred the development of a diverse array of AI accelerators, including:
  - AMD GPUs: High-performance alternatives to NVIDIA GPUs.
  - Google TPUs (Tensor Processing Units): Custom-designed for neural network workloads, with a strong focus on tensor operations.
  - Intel Habana Gaudi: Designed for deep learning training and inference.
  - Graphcore IPUs (Intelligence Processing Units): Feature a unique architecture with a large amount of on-chip memory and a focus on graph-like computations.
  - Groq LPUs (Language Processing Units): Specialized for accelerating large language models.
  - Cerebras Wafer-Scale QPUs (Quant Processing Units): Massive chips designed for extreme parallelism.
Specialization for Inference
- A significant trend is the emergence of chips specifically optimized for inference.
  - Inference Cost Dominance: Studies indicate that inference can often exceed training costs in deployed AI systems, accounting for up to 90% of total ML expenditure.
  - Inference Optimization Focus:
    - Lower Precision: Inference often benefits greatly from lower numerical precision (e.g., INT8, FP8) which reduces memory footprint and speeds up computation.
    - Memory Access: Faster memory access is critical for quickly loading model weights.
    - Latency Minimization: Unlike training which prioritizes throughput, inference often aims to minimize latency.
  - Examples of Inference-Specific Chips: Apple Neural Engine, AWS Inferentia, MTIA (Meta Training and Inference Accelerator).
  - Edge Computing Accelerators: Chips designed for devices with limited power and computational resources, such as Google's Edge TPU and NVIDIA Jetson Xavier series.
  - Architecture-Specific Accelerators: Some chips are tailored for specific model architectures, like transformers.
Hardware Architectures and Compute Primitives
- Different hardware architectures feature distinct memory layouts and specialized compute units optimized for various data types:
- Compute Primitives: These are the basic operations a chip's hardware is designed to perform efficiently. Common primitives include:
  - Scalar Operations: Processing single data points.
  - Vector Operations: Processing arrays of data.
  - Tensor Operations: Processing multi-dimensional arrays (matrices and higher-order tensors), crucial for neural networks.
  - Figure 4: Illustrates different compute primitives. While traditional CPUs excel at scalar operations, GPUs have strong vector capabilities, and specialized AI accelerators (like TPUs) are built around tensor operations.
- Chip Design: A chip might combine these units. GPUs traditionally supported vector operations, but many modern GPUs, for example, have evolved to include "Tensor Cores" specifically optimized for matrix and tensor computations. TPUs, conversely, are designed with tensor operations as their primary focus. To maximize efficiency, a model's operations need to align with the chip's strengths. A chip’s specifications contain many details that can be useful when evaluating this chip for each specific use case.
Key Evaluation Characteristics of Accelerators
- When evaluating hardware for AI workloads, several characteristics are paramount:
  1. Computational Capabilities:
    - Metric: Measured in FLOP/s (Floating-Point Operations Per Second), often expressed in teraFLOPs (TFLOPS) or petaFLOPs (PFLOPS).
    - Precision Dependence: Higher numerical precision (e.g., FP32 vs. FP16 vs. FP8) requires more computation per operation, leading to fewer operations per second.
    - Theoretical Peak vs. Actual: The advertised FLOP/s is a theoretical maximum. Actual performance (MFU) depends on how efficiently the workload can utilize the hardware.
    - Table Example (NVIDIA H100 SXM): Demonstrates how FLOP/s scales with precision, with FP8 offering the highest theoretical throughput when sparsity is utilized.
  2. Memory Size and Bandwidth:
    - Importance: With thousands of parallel cores, efficient data movement is critical. Large AI models and datasets require fast access to memory to keep these cores busy.
    - GPU Memory Technologies: GPUs typically use High-Bandwidth Memory (HBM), a 3D stacked memory technology, which offers significantly higher bandwidth and lower latency compared to the DDR SDRAM used in CPUs (which has a 2D structure). This is a key reason for higher GPU memory costs.
    - Memory Hierarchy: An accelerator’s memory is measured by its size and bandwidth. These numbers need to be evaluated within the system an accelerator is part of. Accelerators interact with multiple memory tiers, each with different speeds and capacities (as visualized in Figure 7):
      - CPU DRAM (System Memory): Lowest bandwidth (25-50 GB/s), largest capacity (1TB+ possible). Used as a fallback.
      - GPU HBM: High bandwidth (256 GB/s to 1.5 TB/s+), moderate capacity (24-80 GB typical for consumer/prosumer GPUs). This is where model weights and activations are primarily stored.
      - GPU On-Chip SRAM (Caches): Extremely high bandwidth (10 TB/s+), very small capacity (tens of MB). Used for immediate access to frequently used data.
    - Framework Limitations: A lot of GPU optimization is about how to make the most out of this memory hierarchy. However, current popular frameworks (PyTorch, TensorFlow) offer limited direct control over this memory hierarchy, prompting interest in lower-level programming languages like CUDA (Compute Unified Device Architecture), Triton, and ROCm (Radeon Open Compute).
  3. Power Consumption:
    - Transistor Switching: Chips rely on transistors to perform computation. Each computation is done by transistors switching on and off, which requires energy. A GPU can have billions of transistors—an NVIDIA A100 has 54 billion transistors, while an NVIDIA H100 has 80 billion. When an accelerator is used efficiently, billions of transistors rapidly switch states, consuming a substantial amount of energy and generating a nontrivial amount of heat. This heat requires cooling systems, which also consume electricity, adding to data centers’ overall energy consumption.
    - Environmental Impact: The massive energy consumption of data centers powering AI is a growing concern, driving demand for energy-efficient hardware and "green data center" technologies.
    - Metrics:
      - Maximum Power Draw: The absolute peak power a chip can consume.
      - TDP (Thermal Design Power): A proxy for power consumption, representing the heat a cooling system must dissipate under typical workloads. For CPUs and GPUs, the actual power draw can exceed TDP 1.1 to 1.5 times.
    - Cloud vs. On-Prem: Cloud users are insulated from direct cooling/electricity costs but should still consider the environmental impact.
Selecting the Right Accelerators
- The choice of accelerator hinges on the specific workload:
  - Compute-Bound Workloads: Prioritize chips with higher FLOP/s.
  - Memory-Bound Workloads: Focus on chips with higher memory bandwidth and larger memory capacity.
- When making a selection, consider these core questions:
  1. Can it run the workload? (Does it have enough compute and memory?)
  2. How long will it take? (What are the expected latency and throughput?)
  3. How much does it cost? (Initial purchase, ongoing power, or cloud rental fees.)
- FLOP/s, memory size, and bandwidth are key to answering the first two questions, while cost is usually more straightforward, though it includes power and cooling for on-premise solutions.

Intuitive Explanations

Compute-Bound vs. Memory-Bound: The Master Chef Analogy

Imagine a master chef (the Processor/GPU Core) in a massive kitchen.

Compute-Bound Task (e.g., Complex Sauce Reduction): The chef is furiously chopping, mixing, and tasting. The recipe is complex and requires immense skill and speed. The limiting factor is the chef's own speed. Kitchen assistants bringing ingredients are waiting on the chef. This is like the prefill phase of an LLM, where massive parallel matrix multiplications max out the GPU's computational power.
Memory-Bound Task (e.g., Assembling a Simple Salad): The recipe is simple: grab lettuce, tomatoes, and dressing. The chef can assemble it instantly but has to wait for an assistant to run to the pantry (HBM Memory) and back for each ingredient. The limiting factor is the assistant's speed (the memory bandwidth). The chef is mostly idle, waiting for data. This is like the decode phase of an LLM, where for each new token, the huge model weights must be read from memory.

Figure 5: The initial processing of the prompt ("Prefill") is a parallel, compute-intensive task. The subsequent generation of each token ("Decode") is a sequential, memory-intensive task.

TTFT vs. TPOT: The User Experience of Waiting

TTFT (Time to First Token) is like asking a question and waiting for the first word of the answer. A low TTFT makes an application feel responsive and "alive." For a chatbot, this is crucial.
TPOT (Time Per Output Token) is the speed at which the rest of the answer is typed out. As long as it's faster than human reading speed (around 6-8 tokens/sec), the experience feels smooth. A very fast TPOT might not be noticeable, but a slow one is frustrating.

Figure 6: This illustrates "goodput." Even if a system processes many requests, only those meeting latency SLOs (e.g., TTFT < 200ms, TPOT < 100ms) count towards goodput. The dark green bars represent requests that failed the SLO. If an inference service can complete 10 RPS but only 3 satisfy the SLO, then its goodput is 3 RPS.

Hardware Memory Hierarchy: The Researcher's Desk

Think of a researcher working on a project. Their access to information has different speeds and capacities.

GPU SRAM (On-Chip Cache): This is the researcher's own brain and the sticky notes right in front of them. Blazingly fast access (10+ TB/s) but very limited space (tens of MB).
GPU HBM (High-Bandwidth Memory): These are the books and papers laid out on their desk. Fast to grab (1.5 TB/s) and holds a decent amount (40-80 GB). This is where the model weights live.
CPU DRAM (System Memory): This is the library down the hall. Huge capacity (up to 1 TB+) but slow to access (25 GB/s). You only go there when you absolutely have to.

Figure 7: The memory hierarchy shows a trade-off: the fastest memory (SRAM) has the smallest capacity, while the largest memory (DRAM) is the slowest. Effective optimization is about keeping the most needed data in the fastest possible tier.

Deep Dive into Inference Optimization Techniques

Introduction

Inference optimization can be approached from three main angles: model-level, hardware-level, and service-level.

To picture the difference, imagine archery:

Model-level optimization → crafting better arrows.
Hardware-level optimization → training a stronger and more skilled archer.
Service-level optimization → refining the entire shooting process, including the bow, aiming, and conditions.

The goal of optimization—especially for speed and cost—is ideally to preserve the model’s quality. However, many techniques can unintentionally degrade performance.

The Figure-8 below (adapted from Cerebras, 2024) illustrates how the same Llama3.1 models perform on various benchmarks when served by different inference providers. Even though the models are identical, differences in provider-level optimization techniques can lead to slight variations in accuracy across tasks.

Since hardware design is outside the scope of this study, the focus here will be on model-level and service-level techniques. In real-world deployments, optimization often involves a blend of methods from multiple levels.

1. Model-Level Optimization

This is like making the arrow itself more aerodynamic. Model-level optimization aims to make a model more efficient, often by modifying the model itself, which can alter its behavior.

Many current foundation models—especially those based on the transformer architecture—include an autoregressive language model component. These models share three characteristics that make inference particularly resource-intensive:

Model size – Large parameter counts demand significant memory and computational power.
Autoregressive decoding – Tokens are generated one at a time, which slows down output generation.
Attention mechanism – Computing attention over long sequences grows more expensive as input length increases.

Model Compression

Model compression refers to techniques that reduce a model’s size, often making it faster and more efficient. This includes three main approaches: quantization, distillation, and pruning.
- Quantization – Reduces the precision of a model’s weights (e.g., from 32-bit to 16-bit), cutting the memory footprint and increasing throughput. A drop from 32 bits to 16 bits halves memory use, making this the most popular compression method because it is easy to apply, widely supported, and highly effective. However, quantization has a hard lower bound of 1 bit per value.
- Distillation – Trains a smaller “student” model to mimic the behavior of a larger “teacher” model, often producing a compact model that performs comparably to the original while being more efficient.
- Pruning – Removes unnecessary parts of the network:
  1. Structural pruning – Deletes entire nodes or layers, changing the architecture and reducing parameters.
  2. Weight pruning – Sets the least useful parameters to zero, creating a sparse model without reducing the total parameter count. This reduces storage needs and can speed up computation—if the hardware supports sparse operations.
- Pruned models can be used directly or finetuned to recover accuracy. They can also inspire smaller architectures to train from scratch. Research (e.g., Frankle & Carbin, 2019) shows pruning can reduce non-zero parameters by over 90% without hurting accuracy. Still, pruning is less common in practice because it is harder to apply, depends on deep knowledge of the model’s architecture, and often yields smaller performance gains than other methods.
- Among these, weight-only quantization dominates in adoption due to its ease, broad compatibility, and strong benefits (reducing a model’s precision from 32 bits to 16 bits reduces its memory footprint by half). However, there is a limit of quantization—we can’t go lower than 1 bit per value. The distillation remains common when a smaller but still capable model is needed.
Overcoming Autoregressive Decoding Bottleneck: This addresses the "one token at a time" bottleneck. Autoregressive language models generate tokens sequentially, which means that producing long outputs can be slow and costly. For example, if it takes 100 ms to generate one token, a response of 100 tokens will take 10 s. This is particularly expensive because output tokens typically cost 2–4× more than input tokens in API usage, and in some cases, a single output token can have the same latency impact as 100 input tokens. Even small improvements to the decoding process can significantly enhance user experience. Although the field is evolving rapidly, several promising techniques aim to accelerate token generation while preserving quality:
- Speculative Decoding (also called speculative sampling): Uses a faster, weaker draft model to propose multiple tokens ahead of time, which the target model verifies in parallel.
  - Process:
    1. Draft model generates K tokens.
    2. Target model verifies these tokens.
    3. Accept the longest verified subsequence and generate one new token.
    4. Repeat.
  - The process is visualized in Figure 9:
  - If no draft token is accepted, this loop produces only one token generated by the target model. If all draft tokens are accepted, this loop produces K + 1 tokens, with K generated by the draft model and one by the target model.
  - If all draft sequences are rejected, the target model must generate the entire response in addition to verifying it, potentially leading to increased latency. However, this can be avoided because of these three insights:
    1. The time it takes for the target model to verify a sequence of tokens is less than the time it takes to generate it, because verification is parallelizable, while generation is sequential. Speculative decoding effectively turns the computation profile of decoding into that of prefilling.
    2. In an output token sequence, some tokens are easier to predict than others. It’s possible to find a weaker draft model capable of getting these easier-to-predict tokens right, leading to a high acceptance rate of the draft tokens.
    3. Decoding is memory bandwidth-bound, which means that during the coding process, there are typically idle FLOPs that can be used for free verification.
  - Domain impact: Acceptance rates are domain-dependent. Works best for structured outputs (e.g., code) where the draft model can achieve a high acceptance rate.
  - Example: The draft model can be of any architecture, though ideally it should share the same vocabulary and tokenizer as the target model. We can train a custom draft model or use an existing weaker model. For example, DeepMind’s Chinchilla-70B has been speeded up by training a 4B draft model of the same architecture which achieved \>2× speedup (reduced response latency) with no loss in quality. The draft model can generate a token 8 times faster than the target model (1.8 ms/token compared to 14.1 ms/token). A similar speed-up was achieved for T5-XXL (Laviathan et al., 2022).
  - This approach has gained traction because it’s relatively easy to implement and doesn’t change a model’s quality. For example, it’s possible to do so in 50 lines of code in PyTorch. It’s been incorporated into popular inference frameworks such as vLLM, TensorRT-LLM, and llama.cpp.
- Inference with Reference: When outputs repeat text from inputs (e.g., quoting a document, reusing most of a code snippet), the model can copy tokens directly instead of generating them.
  - Advantages: No extra model needed. Inference with reference is similar to speculative decoding, but instead of using a model to generate draft tokens, it selects draft tokens from the input. The key challenge is to develop an algorithm to identify the most relevant text span from the context at each decoding step. The simplest option is to find a text span that matches the current tokens.
  - Best use cases: Retrieval-augmented generation, code editing, multi-turn conversation. In “Inference with Reference: Lossless Acceleration of Large Language Models” (Yang et al., 2023), this technique helps achieve two times generation speedup in such use cases.
  - Performance: Can yield ~2× speedup in repetition-heavy scenarios. Examples of how inference with reference works are shown in Figure 10 below (2 examples of inference with reference. The text spans that are successfully copied from the input are in red and green. Image from Yang et al. (2023). The image is licensed under CC BY 4.0.).
- Parallel Decoding: Using multiple "prediction heads" to guess several future tokens at once, then verifying and integrating them. Breaks the sequential dependency by generating future tokens simultaneously.
  - Approaches:
    - Lookahead decoding: Uses the same decoder to predict multiple tokens ahead, verifying them iteratively.
    - Medusa: Adds extra decoding “heads” which are trained together with the original model, but the original model is frozen, and each head is a small NN layer that is then trained to predict tokens at different future positions, verified via a tree-based attention search.
  - Challenges: Complex to implement; requires additional verification steps to ensure that they fit together, because these tokens aren’t generated sequentially. Lookahead decoding uses the Jacobi method to verify the generated tokens - Jacobi decoding. On the other hand, Medusa uses a tree-based attention mechanism to verify and integrate tokens. Each Medusa head produces several options for each position. These options are then organized into a tree-like structure to select the most promising combination. The process is visualized in Figure 11.
  - Impact: NVIDIA reported up to 1.9× speedup for Llama 3.1 using Medusa on HGX H200 GPUs.

Figure 11: Medusa uses extra "heads" to predict multiple future tokens in parallel. These predictions form a tree of possibilities, and the best path is chosen in a single step, accelerating generation. Each head predicts several options for a token position. The most promising sequence from these options is selected. Image adapted from the paper, which is licensed under CC BY 4.0.

Attention Mechanism Optimization: The attention mechanism's cost grows quadratically with sequence length, because generating the next token requires the key and value vectors for all previous tokens, and the KV Cache (the cache that stores key and value vectors for reuse; grows linearly with sequence length) is its biggest memory hog.
- KV Cache: A brilliant hack to avoid re-calculating attention for past tokens. But for long contexts, it can become larger than the model itself!
  - NOTE: A KV cache is used only during inference, not training. During training, because all tokens in a sequence are known in advance, next token generation can be computed all at once instead of sequentially, as during inference. Therefore, there’s no need for a KV cache.
  - KV cache size increases proportionally with larger batch sizes. Example calculation (Google paper, Pope et al., 2022):
    - Model: 500B+ parameters with multi-head attention
    - Batch size: 512
    - Context length: 2048 tokens
    - Result: KV cache = 3TB
    - Scale: 3x larger than the model weights themselves
  - Key Insight: KV cache can become the dominant memory bottleneck, exceeding even model parameter storage
- KV Cache Limitations & Solutions
  - Core Problems:
    - Hardware bottleneck: KV cache size limited by available storage
    - Latency issues: Large cache takes time to load into memory
    - Context length barrier: Memory/compute requirements prevent longer contexts
  - Root Cause:
    - Attention mechanism's computation and memory demands are the primary obstacle to extended context windows
  - Solution Categories:
    1. Attention redesign - Modify the fundamental attention mechanism
    2. KV cache optimization - Improve how key-value pairs are stored/managed
    3. Kernel optimization - Write specialized computation kernels for attention
  - Bottom Line: Attention's resource intensity drives most long-context limitations, spurring diverse efficiency approaches
- Redesigning Attention Mechanisms:
  - Key Constraint:
    - Must be applied during training/finetuning (changes model architecture)
    - Helps optimize inference but requires architectural modifications
  - Techniques Overview:
    1. Local Windowed Attention (Beltagy et al., 2020)
      - Method: Attend only to fixed-size window of nearby tokens
      - Example: 10,000 token sequence → 1,000 token window = 10x KV cache reduction
      - Enhancement: Can interleave with global attention
        
        Local: captures nearby context
        
        Global: captures task-specific cross-document info
    2. Cross-Layer Attention (Brandon et al., 2024)
      - Method: Share key-value vectors across adjacent layers
      - Result: 3 layers sharing same KV pairs = 3x cache reduction
    3. Multi-Query Attention (Shazeer, 2019)
      - Method: Share key-value vectors across all query heads
      - Effect: Reduces KV pairs by consolidating across heads
    4. Grouped-Query Attention (Ainslie et al., 2023)
      - Method: Generalization of multi-query attention
      - Approach: Group query heads, share KV pairs only within groups
      - Benefit: Flexible balance between query heads and KV pairs
  - Real-World Impact: Character.AI - AI Chatbot Application Case Study
    - Context: 180 messages average per conversation
    - Bottleneck: KV cache size limited inference throughput
    - Solution combo: Multi-query + local/global interleaving + cross-layer attention
    - Result: >20x KV cache reduction → memory no longer bottlenecks large batch sizes
- KV Cache Optimization Techniques
  - Core Goal: Reduce memory bottleneck during inference & enable larger batch sizes for long-context applications
  - Key Techniques:
    1. PagedAttention (vLLM Framework)
      - Innovation: Divides KV cache into non-contiguous blocks
      - Benefits:
        
        Reduces memory fragmentation
        
        Enables flexible memory sharing
        
        Improves LLM serving efficiency
      - Impact: Major factor in vLLM's rapid adoption
    2. Other KV Cache Methods:
      - KV cache quantization (Hooper et al., 2024; Kang et al., 2024)
      - Adaptive KV cache compression (Ge et al., 2023)
      - Selective KV cache
  - Writing Kernels for Attention Computation
    - Approach: Optimize how attention scores are computed rather than changing mechanism/storage
    - FlashAttention Example:
      - Method: Fuses multiple transformer operations into single kernel
      - Hardware-specific: Originally for NVIDIA A100, later FlashAttention-3 for H100
      - Result: Dramatic speed improvements (see Image 1: ~17ms → ~1ms)
    - Figure 12: Standard PyTorch attention involves multiple separate operations, each requiring a slow round-trip to GPU memory. FlashAttention fuses these into a single, highly optimized kernel, drastically reducing execution time.
  - Kernels & Compilers Overview
    - What are Kernels:
      - Specialized code optimized for specific hardware (GPUs, TPUs)
      - Handle computationally intensive, repetitive operations
      - Common AI operations: matrix multiplication, attention, convolution
      - Whenever a new hardware architecture is introduced, new kernels need to be developed: original FlashAttention was developed for NVIDIA A100 GPUs, but later FlashAttention-3 was introduced for H100 GPUs (Shah et al., 2024)
    - Programming Requirements:
      - Languages: CUDA (NVIDIA), Triton (OpenAI), ROCm (AMD)
      - Knowledge needed: Hardware architecture, memory hierarchy, thread management
      - Entry Barrier: Higher-level than Python, historically "dark art" practiced by a few; chip makers like NVIDIA and AMD employ optimization engineers to write kernels to make their hardware efficient for AI workloads, whereas AI frameworks like PyTorch and TensorFlow employ kernel engineers to optimize their frameworks on different accelerators.
      - Trend: More AI engineers learning due to inference optimization demand
    - Four Common Optimization Techniques:
      - Vectorization
        
        Given a loop/nested loop, process multiple contiguous data elements simultaneously
        
        Reduces latency by minimizing data I/O operations
      - Parallelization
        
        Divide arrays into independent chunks for simultaneous processing
        
        Utilizes multiple cores/threads
      - Loop Tiling
        
        Optimize data access order for hardware memory layout/cache
        
        Hardware-dependent (CPU ≠ GPU patterns)
      - Operator Fusion
        
        Combine multiple operators into single pass
        
        Reduces redundant memory access
        
        Requires deep model architecture understanding
        
        Example: fuse two loops operating on the same array into a single one → reduced № of data read/write
    - Compilation Process:
      - Lowering: A model script specifies a series of operations that need to be performed to execute that model and to run this code on a piece of hardware (GPU), it has to be converted into a language compatible with that hardware - this is the lowering. During the lowering process, whenever possible, these operations are converted into specialized kernels to run faster on the target hardware.
      - Compilers: A tool that lowers code to run a specific hardware and bridge ML models with hardware execution
      - Examples: torch.compile, Apache TVM, MLIR (Multi-Level Intermediate Representation), XLA (Accelerated Linear Algebra) & OpenXLA, TensorRT
    - PyTorch Llama-7B Optimization Case Study
      - Sequential Improvements (A100 GPU, 80GB memory):
        
        Baseline (eager): 25.5 tok/s/user
        
        torch.compile: 107.0 tok/s/user (+320%)
        
        INT8 quantization: 157.4 tok/s/user (+47%)
        
        INT4 quantization: 202.1 tok/s/user (+28%)
        
        INT4 + speculative decoding: 244.7 tok/s/user (+21%)
      - Total improvement: ~860% throughput increase
      - Note: Quality impact of optimizations unclear from study
      - Figure 13: This chart shows the compounding effect of different optimization techniques on a Llama-7B model. Each step—compiling, quantizing, and adding speculative decoding—provides a significant boost in throughput (tok/s/user).

2. Service-Level Optimization

Core Goal: Efficiently allocate fixed resources (compute/memory) to dynamic workloads while optimizing latency and cost
Key principle: Service-level techniques don't modify models or change output quality
Batching Strategies
- Concept: Process multiple requests together (like putting people on a bus vs. individual cars)
- Benefit: Significantly reduces cost and increases throughput
- Trade-off: May increase individual request latency
  1. Static Batching
    - Method: Wait for fixed number of inputs before processing
    - Analogy: Bus waits until every seat filled
    - Problem: First requests delayed until batch is full
  2. Dynamic Batching
    - Method: Set maximum batch size AND time window
    - Example: Process when 4 requests OR 100ms elapsed (whichever first)
    - Analogy: Bus leaves on schedule OR when full
    - Benefits: Controls latency for early requests
    - Drawback: May waste compute with unfilled batches
    - Figure 14: Dynamic batching keeps the latency manageable but might be less compute-efficient.
  3. Continuous Batching (In-flight Batching)
    - Naive Batching Problem
      - Core Issue: All batch requests must complete before ANY responses are returned
      - LLM-Specific Challenge: Highly variable response lengths
        
        Example scenario:
        
        Request A: generates 10 tokens
        
        Request B: generates 1,000 tokens
        
        Problem: Request A waits for Request B to finish (100x longer)
      - Result: Unnecessary latency penalty for shorter requests
      - Why This Matters: Response time becomes bottlenecked by the longest request in each batch, defeating the purpose of batching for faster requests
      - This problem is what continuous batching solves by allowing completed responses to return immediately
    - Innovation: Return responses as soon as completed (don't wait for entire batch)
    - Method: Replace completed requests with new ones immediately
    - Analogy: Bus picks up new passenger after dropping one off
    - Problem solved: Short responses don't wait for long responses (10 tokens vs 1,000 tokens)
    - Source: Introduced in Orca paper (Yu et al., 2022)
    - Figure 15: In Normal Batching, the entire batch must wait for the longest request (R7) to finish. In Continuous Batching, requests (R1, R2, R3, R4) are processed independently, and new requests (R5, R6) can start as soon as slots open up, dramatically improving efficiency.
Decoupling Prefill and Decode
- Problem:
  - Prefill: Compute-bound
  - Decode: Memory bandwidth-bound
  - Running both on same machine causes resource competition → slows TTFT and TPOT
- Solution: Assign prefill and decode to different instances (GPUs)
  - Research: DistServe (Zhong et al., 2024), "Inference Without Interference" (Hu et al., 2024)
  - Communication Overhead: Minimal with high-bandwidth connections (NVLink)
- Instance Ratios:
  - Long inputs + prioritize TTFT: 2:1 to 4:1 (prefill:decode)
  - Short inputs + prioritize TPOT: 1:2 to 1:1 (prefill:decode)
Prompt Caching (Gim et al., 2023)
- Concept: Store overlapping text segments for reuse across prompts
- Alternative names: Context cache, prefix cache
- Use Cases:
  - System prompts: Process once, reuse for all queries
  - Long documents: Cache document for multiple related queries
  - Long conversations: Cache earlier messages for future predictions
  - Figure 16: A prompt cache visualized.
- Impact Example:
  - 1,000-token system prompt + 1M daily API calls = ~1B repetitive tokens saved daily
- Real-World Pricing (as of writing):
  - Google Gemini: 75% discount on cached tokens + $1.00/million tokens per hour cache storage
  - Anthropic: Up to 90% cost savings + up to 75% latency reduction
- Anthropic Performance Data:

Use Case	Latency w/o caching (TTFT)	Latency with caching (TTFT)	Cost reduction
Chat with book (100K cached tokens)	11.5 s	2.4 s (-79% latency)	-90%
Many-shot prompting (10K tokens)	1.6 s	1.1 s (-31% latency)	-86%
Multi-turn conversation (10-turn convo with a long system prompt)	~10 s	~2.5 s (-75% latency)	-53%

Trade-offs
- Large cache size consumes memory
- Significant engineering effort to implement yourself unless using API with built-in functionality
How It Works
- Prompt Segmentation
  - The system splits a prompt into segments (often at token level).
  - Example: Static system instructions → cached; changing user input → un-cached.
- Hashing & Storage
  - Each segment is hashed (using a cryptographic hash like SHA-256) to create a unique identifier for its token sequence.
  - These hashes are stored in a cache (memory, Redis, or a distributed key–value store) along with the model’s internal representation (the hidden states after token embedding and transformer processing).
- Cache Lookup
  - On a new request, the system checks if the hashed segments already exist in the cache.
  - If found, it retrieves the precomputed hidden states instead of recomputing them.
- Partial Processing
  - Cached segments are fed into the model as precomputed key-value pairs for the attention layers.
  - The model only needs to process the new tokens, which dramatically reduces compute.
Implementation Strategies
- A. Client-Side Prompt Caching
  - The application detects repeated parts of prompts before sending them to the LLM API.
  - Sends a “cache hit” reference instead of re-sending the entire segment.
  - Used in some streaming APIs and enterprise systems.
- B. Server-Side Prompt Caching
  - The LLM provider (e.g., OpenAI, Anthropic) handles caching automatically.
  - You send the whole prompt, but the backend identifies repeated segments and skips re-computation.
  - Typically implemented with:
    - Token hashing for identification.
    - KV (key–value) cache in GPU memory for quick retrieval.
    - Persistent store for cross-session caching.
- C. Fine-Tuning With Cached Context
  - For ultra-repeated prompts, fine-tuning or LoRA adapters may store static context within the model itself, making caching unnecessary.
Key Challenges
- Tokenization Consistency – Even minor text changes can alter tokenization and cause cache misses.
- Cache Storage Size – Storing full hidden states for large models can require significant GPU RAM or disk space.
- Cache Eviction – Implementing LRU (least recently used) or TTL (time-to-live) policies to manage memory.
- Security – Hash collisions are rare but possible; sensitive prompts require careful handling.
Example Workflow
- First request:
  - [System prompt: "You are a legal assistant..."] + [Case background: 3000 tokens] + [User question]
  - System prompt & background are hashed → stored in cache.
  - Model processes everything → stores hidden states.
- Next request (same background, different question):
  - Cache hit for system prompt & background.
  - Only the new question’s tokens are processed fresh.
  - Latency drops significantly.

Parallelism Strategies
- Two Universal Families
  - Data parallelism
  - Model parallelism
- LLM-Specific Families
  - Context parallelism
  - Sequence parallelism
Implementation Strategies
- 1. Replica Parallelism
  - Method: Create multiple model copies to handle more requests at the same time.
  - Constraint: Model may be too large to fit on one machine.
  - Challenge: Bin-packing problem with different model sizes (8B, 13B, 34B, 70B) and GPU memory (24GB, 40GB, 48GB, 80GB).
    - Optimization Scenarios:
      - 1. Fixed Hardware → Optimize Model Deployment
        
        Given: Fixed number of GPUs
        
        Decision: How many replicas per model + GPU allocation strategy
        
        Goal: Maximize performance metrics
        
        Example dilemma: 40GB GPU usage
        
        Option A: 3x 13B models
        
        Option B: 1x 34B model
        
        Challenge: Balance throughput vs. model capability
      - 2. Fixed Model Requirements → Optimize Hardware Purchase
        
        Given: Fixed number of model replicas needed
        
        Decision: Which GPUs to acquire
        
        Goal: Minimize hardware cost
        
        Reality check: Rarely occurs in practice
      - Key insight: Bin-packing becomes increasingly complex with more models, replicas, and GPU types
- 2. Model Parallelism
  - Use Case: Model too large for single machine.
  - Types:
    - Tensor Parallelism (Intra-operator)
      - Method: Split tensors across devices for parallel execution.
      - Example: Split matrix column-wise for matrix multiplication.
      - Benefits: Enables large model serving + reduces latency.
      - Drawbacks: Communication overhead.
      - Figure 17: Tensor parallelism illustrated for mat-mul.
    - Pipeline Parallelism
      - Method: Divide model into stages, assign each stage to a different device.
      - Process: Split batch into micro-batches, pass output between stages.
      - Benefits: Enables large model serving.
      - Drawbacks: Increases total latency due to inter-stage communication.
      - Usage: Avoided for strict latency requirements; more common in training.
      - Figure 18: Pipeline parallelism illustrated on 4 machines.
    - FSDP (Fully Sharded Data Parallel)
      - Core Innovation: Memory-efficient training strategy that distributes model components across GPUs.
      - How It Works:
        
        Shards distributed: model weights, gradients, AND optimizer states.
        
        No GPU stores a complete model copy.
        
        Memory principle: No single GPU holds entire model simultaneously.
      - Benefits:
        
        Larger models possible on fewer GPUs.
        
        Memory efficiency by minimizing per-GPU overhead.
        
        Communication optimization by reducing redundancy.
      - Traditional Data Parallelism vs. FSDP:
        
        Traditional: Each GPU = full model copy (high memory usage).
        
        FSDP: Each GPU = model fragments only (distributed memory load).
      - Key Insight: Enables scaling up model size without proportionally scaling hardware.
- 3. Specialized LLM Parallelism
  - Context Parallelism
    - Method: Split input sequence across devices.
    - Example: First half on machine 1, second half on machine 2.
  - Sequence Parallelism
    - Method: Split operators across machines.
    - Example: Attention on machine 1, feedforward on machine 2.

Insights & Relationships

The Bottleneck Defines the Solution: A memory-bound problem (decoding) won't be solved by a more powerful processor (more FLOP/s). It needs higher memory bandwidth or techniques that reduce memory traffic (like FlashAttention and KV cache quantization).
Latency vs. Throughput Trade-off: Almost every optimization forces a choice. Batching increases throughput (lower cost per request) but can increase latency. You must optimize for the metric that matters most to your users.
Hardware and Software Co-design: The most advanced optimizations (like FlashAttention) are born from a deep understanding of the hardware architecture (memory hierarchy, compute units). This is why companies like NVIDIA, Google, and OpenAI develop software (CUDA, Triton, XLA) alongside their hardware.
The Rise of Inference-Specific Solutions: As training becomes more centralized, the real-world cost of AI is shifting to inference. This drives innovation in inference-specific hardware (e.g., AWS Inferentia), architectures (MQA/GQA), and serving systems (vLLM, TGI).
Quantization is the Low-Hanging Fruit: It is the easiest, most reliable, and often most impactful optimization. An 8-bit quantized model uses half the memory and bandwidth of its 16-bit version, often with negligible quality loss.

Practice Questions / Flashcards

Q: Why is the LLM prefill phase typically compute-bound?
- A: Because it processes all input tokens simultaneously in large, parallel matrix multiplications, which maxes out the GPU's computational capacity (FLOP/s).
Q: Why is the LLM decode phase typically memory bandwidth-bound?
- A: Because for each token, it must load the entire set of model weights from slow HBM memory. The computation per token is small, so the bottleneck is data movement speed, not calculation speed.
Q: What is the key difference between nvidia-smi's "GPU Utilization" and MFU (Model FLOP/s Utilization)?
- A: nvidia-smi utilization only shows if the GPU is active (busy), not if it's being used efficiently. A GPU can be 100% busy but only performing 1% of its peak FLOP/s. MFU measures the actual efficiency by comparing achieved FLOP/s to the theoretical maximum.
Q: How does Speculative Decoding speed up inference without changing the final output of the target model?
- A: It uses a smaller, faster "draft" model to generate candidate tokens, which are then verified in a single, parallel pass by the larger, accurate "target" model. Since verification is parallelizable and faster than sequential generation, this accelerates the process. The final output is always what the target model would have produced.
Q: What is the primary purpose of the KV Cache, and what is its main drawback?
- A: Its purpose is to store the key/value vectors from the attention mechanism for all previous tokens, avoiding expensive re-computation at each new step. Its main drawback is its massive memory consumption, which grows linearly with sequence length and batch size and can become the main bottleneck for long-context applications.
Q: Explain the difference between Tensor Parallelism and Pipeline Parallelism.
- A: Tensor Parallelism splits a single operation (like a matrix multiplication) across multiple devices, reducing latency. Pipeline Parallelism splits the model's layers across devices, with each device handling a different stage. This increases throughput but adds latency due to inter-device communication.
Q: Your chatbot feels slow to start answering but then generates text quickly. Which metric would you focus on improving: TTFT or TPOT?
- A: You would focus on improving TTFT (Time to First Token), as this metric governs the initial response latency that the user perceives as "slowness to start."

The AI Engineer's Guide to Inference Optimization: Making Models Faster & Cheaper

Table of contents