The AI Engineer's Guide to Inference Optimization: Making Models Faster & Cheaper


Welcome to a deep dive into one of the most critical and fascinating areas of AI Engineering: Inference Optimization. While building powerful models is one part of the equation, making them run efficiently—faster, cheaper, and at scale—is what makes them viable in the real world. Whether you're building your own inference service or evaluating a third-party API, understanding these principles is non-negotiable.
This guide breaks down the core concepts, metrics, and techniques that turn a slow, expensive model into a production-ready powerhouse.
Note: This article is my personal rewritten version of concepts I’ve studied from Chip Huyen’s book, “AI Engineering.” It is not an official summary or reproduction — rather, a learning exercise to reinforce and share what I’ve learned.
Key Concepts
Before we dive in, let's anchor ourselves with the central ideas of this article:
Inference vs. Training: The two phases of a model's lifecycle. We focus on inference—using a trained model for predictions.
Inference Overview: In production, the component that runs model inference is called an inference server. It hosts the available models and has access to the necessary hardware. Based on requests from applications (e.g., user prompts), it allocates resources to execute the appropriate models and returns the responses to users. An inference server is part of a broader inference service, which is also responsible for receiving, routing, and possibly preprocessing requests before they reach the inference server. Model APIs like those provided by OpenAI and Google are inference services. If you use one of these services, you won’t be implementing most of the techniques discussed in this article. However, if you host a model yourself, you’ll be responsible for building, optimizing, and maintaining its inference service. A visualization of a simple inference service is shown in Figure 1.
Computational Bottlenecks: The two primary constraints on performance:
Compute-Bound: Limited by the number of calculations (FLOP/s) a processor can perform - is determined by the computation needed for the tasks to complete. For example, password decryption is typically compute-bound due to the intensive mathematical calculations required to break encryption algorithms.
Memory Bandwidth-Bound: Limited by the speed at which data can be moved between memory and the processor - the data transfer rate within the system. For example, if you store your data in the CPU memory and train a model on GPUs, you have to move data from the CPU to the GPU, which can take a long time. If you’re constrained by GPU memory and cannot fit an entire model into the GPU, you can split the model across GPU memory and CPU memory. This splitting will slow down your computation because of the time it takes to transfer data between the CPU and GPU. However, if data transfer is fast enough, this becomes less of an issue. Therefore, the memory capacity limitation is actually more about memory bandwidth.
LLM Inference Phases:
Prefill (Compute-Bound): Processing the input prompt in parallel. How many tokens can be processed at once is limited by the number of operations your hardware can execute in a given time. Therefore, prefilling is compute-bound.
Decode (Memory-Bound): Generating output tokens one by one (autoregressively; one output token at a time). At a high level, this step typically involves loading large matrices (e.g., model weights) into GPUs, which is limited by how quickly your hardware can load data into memory. Decoding is, therefore, memory bandwidth-bound.
Because prefill and decode have different computational profiles, they are often decoupled in production with separate machines.
Today, due to the prevalence of the transformer architecture and the limitations of the existing accelerator technologies, many AI and data workloads are memory bandwidth-bound. However, future software and hardware advancements will be able to make AI and data workloads compute-bound.
Online and batch inference APIs
Online APIs optimize for latency. Requests are processed as soon as they arrive. Customer-facing use cases, such as chatbots and code generation, typically require lower latency, and, therefore, tend to use online APIs.
Batch APIs optimize for cost. If your application doesn’t have strict latency requirements, you can send them to batch APIs for more efficient processing. Higher latency allows a broader range of optimization techniques, including batching requests together and using cheaper hardware. For example, as of this writing, both Google Gemini and OpenAI offer batch APIs at a 50% cost reduction and significantly higher turnaround time, i.e., in the order of hours instead of seconds or minutes. Use cases with less stringent latency requirements, which are ideal for batch APIs, include the following:
Synthetic data generation;
Periodic reporting, such as summarizing Slack messages, sentiment analysis of brand mentions on social media, and analyzing customer support tickets;
Onboarding new customers who require processing of all their uploaded documents;
Migrating to a new model that requires reprocessing of all the data;
Generating personalized recommendations or newsletters for a large customer base;
Knowledge base updates by reindexing an organization’s data.
APIs usually return complete responses by default. However, with autoregressive decoding, it can take a long time for a model to complete a response, and users are impatient. Many online APIs offer streaming mode, which returns each token as it’s generated. This reduces the time the users have to wait until the first token. The downside of this approach is that you can’t score a response before showing it to users, increasing the risk of users seeing bad responses. However, you can still retrospectively update or remove a response as soon as the risk is detected.
WARNING: A batch API for foundation models differs from batch inference for traditional ML. In traditional ML:
Online inference means that predictions are computed after requests have arrived.
Batch inference means that predictions are precomputed before requests have arrived.
Pre-computation is possible for use cases with finite and predictable inputs like recommendation systems, where recommendations can be generated for all users in advance. These precomputed predictions are fetched when requests arrive, e.g., when a user visits the website. However, with foundation model use cases where the inputs are open-ended, it’s hard to predict all user prompts.
Core Performance Metrics:
Latency: Time to First Token (TTFT) and Time Per Output Token (TPOT).
Throughput: Tokens per second (TPS), a proxy for cost efficiency.
Utilization: Model FLOP/s Utilization (MFU) and Model Bandwidth Utilization (MBU), which measure true hardware efficiency.
Key Optimization Strategies:
Model-Level: Quantization, Pruning, Speculative Decoding, Attention Optimization (KV Cache).
Service-Level: Batching (Static, Dynamic, Continuous), Parallelism (Tensor, Pipeline), Prompt Caching.
Definitions & Equations
1. Computational Bottlenecks
Inference Server: The software component that runs model inference, hosting models, and managing hardware resources to serve requests.
Compute-Bound: A task whose completion time is limited by the processor's calculation speed (FLOP/s).
Memory Bandwidth-Bound (or Memory-Bound): A task whose completion time is limited by the rate of data transfer between memory (e.g., HBM) and the processor cores.
Arithmetic Intensity: The ratio of arithmetic operations to memory access operations for a given task. It's the key factor in determining if a task is compute- or memory-bound.
Formula:
Arithmetic Intensity = FLOPs / Byte
Low Intensity: Memory-bound (many memory accesses per calculation).
High Intensity: Compute-bound (many calculations per memory access).
Figure 2: The roofline chart visualizes whether an operation is compute-bound or memory-bound. The "roofline" is formed by the hardware's peak FLOP/s (the flat part) and its peak memory bandwidth (the slanted part). Workloads below the line are limited by their position on the chart.
2. LLM Inference Metrics
Time to First Token (TTFT): The time from when a user sends a query until the first output token is generated. This corresponds to the prefill step and depends on the input’s length.
User Expectations: Users might have different expectations for TTFT for different applications.
Chatbots/Real-time Interaction: Users expect near-instantaneous TTFT, making the interaction feel fluid and responsive.
Batch Tasks (e.g., Document Summarization): Users may tolerate a longer TTFT, as the primary goal is the complete output, not immediate feedback.
Impact of Intermediate Steps:
Scenario: Consider queries involving Chain-of-Thought (CoT) or Agentic Queries. The model might perform internal "planning" or "action" steps before generating the first token the user sees.
Model Perspective: The model might consider the first token generated after internal planning as its "first token."
User Perspective: The user only sees the first token of the final response, which occurs after all internal steps are completed. This leads to a longer observed TTFT for the user compared to the model's internal measurement. To clarify this, some teams use the term "time to publish" to specifically denote the TTFT as perceived by the end-user.
Time Per Output Token (TPOT): The average time taken to generate each subsequent token after the first one. This corresponds to the decode step. If each token takes 100 ms, a response of 1,000 tokens will take 100 s. In the streaming mode, where users read each token as it’s generated, TPOT should be faster than human reading speed but doesn’t have to be much faster. A very fast reader can read 120 ms/token, so a TPOT of around 120 ms, or 6–8 tokens/second, is sufficient for most use cases.
Time Between Tokens (TBT) / Inter-Token Latency (ITL): These are essentially synonyms for TPOT, emphasizing the time gap between consecutively generated tokens.
Total Latency:
TTFT + (TPOT × Number of Output Tokens)
User Experience Nuance: Two applications with the same total latency can offer different user experiences with different TTFT and TPOT. Would your users prefer instant first tokens with a longer wait between tokens, or would they rather wait slightly longer for the first tokens but enjoy faster token generation afterward? User studies will be necessary to determine the optimal user experience. Reducing TTFT at the cost of higher TPOT is possible by shifting more compute instances from decoding to prefilling and vice versa.
The Importance of Percentiles for Latency:
Why Averages Can Be Misleading: Simply looking at the average TTFT or TPOT can be highly deceptive. A single, extremely slow request (an outlier) due to a transient network issue, a very long input prompt, or a rare internal model state can inflate the average, making the service appear much slower than it is for most users.
- Example: A set of TTFTs: [100, 102, 100, 100, 99, 104, 110, 90, 3000, 95] ms. The average is 390 ms, suggesting a very slow service. However, 9 out of 10 requests were under 110 ms.
Percentiles as a Better Indicator: Latency is best understood as a distribution. Percentiles help analyze this distribution:
p50 (Median): The 50th percentile. Half of the requests are faster than this value, and half are slower. It provides a good sense of typical performance.
p90, p95, p99: These percentiles indicate the latency for the slowest 10%, 5%, and 1% of requests, respectively. They are crucial for identifying and addressing outliers and ensuring a good experience for the vast majority of users.
Visualizing with Input Length: Plotting TTFT values against input lengths can reveal if longer inputs are disproportionately increasing latency, guiding optimization efforts towards the prefill stage.
Throughput: This metric quantifies the overall rate at which output tokens are produced by the system (inference service) across all users and requests - the number of output tokens per second.
Definition: Measured in tokens per second (TPS), it reflects the system's capacity to generate output.
Input vs. Output Throughput: In modern inference servers, especially those decoupling prefill and decode, it's important to distinguish between input throughput (tokens/sec (TPS) for prefilling) and output throughput (tokens/sec (TPS) for decoding). When "throughput" is mentioned without a modifier, it typically refers to output tokens per second (TPS).
Scaling: To understand how the system scales with user load, throughput can also be expressed as tokens/second/user.
Alternative Measures: For foundation models with potentially long response times, Requests Per Minute (RPM) or Completed Requests Per Minute (CRPM) are sometimes used instead of Requests Per Second (RPS) to better capture system handling of concurrent tasks. Tracking this metric is useful for understanding how an inference service handles concurrent requests. Some providers might throttle your service if you send too many concurrent requests at the same time.
Cost Linkage: Throughput is directly tied to cost. Higher throughput generally leads to a lower cost per token or per request.
Example: If your system costs $2/h in compute and its throughput is 100 tokens/s, it costs around $5.556 per 1M output tokens. If each request generates 200 output tokens on average, the cost for decoding 1K requests would be $1.11. The prefill cost can be similarly calculated. If your hardware costs $2 per hour and it can prefill 100 requests per minute, the cost for prefilling 1K requests would be $0.33. The total cost per request is the sum of the prefilling and decoding costs. In this example, the total cost for 1K requests would be $1.11 + $0.33 = $1.44.
Cost Calculation Breakdown (Throughput, Prefill, and Decoding): The core idea is to translate raw performance metrics (throughput, requests per minute) into cost per unit of work (per token, per request).
Decoding Cost Calculation
Given Information:
System Compute Cost: $2 per hour
Decoding Throughput: 100 output tokens per second (TPS)
Average Tokens per Request: 200 tokens
Target: Cost per 1 Million (1M) output tokens
Step 1: Calculate Cost per Second of Compute
Your system costs $2 per hour.
There are 3600 seconds in an hour.
Cost per second = $2 / 3600 seconds ≈ $0.0005556 per second.
Step 2: Determine How Many Tokens Are Processed per Second of Compute
- Your system's decoding throughput is 100 tokens/second. This means for every second your system is running and decoding, it produces 100 tokens.
Step 3: Calculate the Cost to Produce 1 Token
Cost per token = (Cost per second of compute) / (Tokens produced per second)
Cost per token = $0.0005556 / 100 tokens ≈ $0.000005556 per token.
Step 4: Calculate the Cost per 1 Million Output Tokens
Cost per 1M tokens = (Cost per token) × 1,000,000 tokens
Cost per 1M tokens = 0.000005556 × 1,000,000 ≈ 5.556 per 1M output tokens.
Step 5: Calculate the Cost for 1K Requests (Decoding Only)
First, find the total number of output tokens for 1,000 requests.
Total output tokens = (Number of requests) × (Average tokens per request)
Total output tokens = 1,000 requests × 200 tokens/request = 200,000 tokens.
Now, calculate the cost for these tokens using the cost per token:
Cost for 1K requests (decoding) = (Cost per token) × (Total output tokens)
Cost for 1K requests (decoding) = 0.000005556 × 200,000 ≈ 1.11.
Prefill Cost Calculation
Given Information:
System Compute Cost: $2 per hour
Prefill Rate: 100 requests per minute
Step 1: Convert System Cost to Per-Minute
System cost per hour: $2
There are 60 minutes in an hour.
Cost per minute = $2 / 60 minutes ≈ $0.03333 per minute.
Step 2: Calculate the Cost to Prefill 1 Request
The system can prefill 100 requests per minute.
Cost per prefill = (Cost per minute of compute) / (Requests prefilled per minute)
Cost per prefill = $0.03333 / 100 requests ≈ $0.0003333 per request.
Step 3: Calculate the Cost for 1K Requests (Prefill Only)
Cost for 1K requests (prefill) = (Cost per prefill) × 1,000 requests
Cost for 1K requests (prefill) = 0.0003333 × 1,000 ≈ 0.33.
Total Cost per Request
Given Information:
Decoding cost per 1K requests: $1.11
Prefill cost per 1K requests: $0.33
Calculation:
Total cost per request = (Total cost for 1K requests) / 1000 requests
Total cost per request = ($1.11 + $0.33) / 1000
Total cost per request = 1.44/1000 = 0.00144 per request.
(The text implies the sum of costs for 1K requests is $1.44, meaning $1.11 for decoding 1K requests and $0.33 for prefilling 1K requests. This makes the total cost $1.44 for 1,000 full requests, where each request involves both prefilling and decoding. The calculation is showing that the combined cost for 1000 requests (1000 prefill operations + 1000 decode operations) is $1.44).
Factors Influencing Throughput: Goodput is influenced by the model size, hardware capabilities (e.g., high-end chips generally yield higher throughput), and workload characteristics (e.g., consistent input/output lengths are easier to optimize).
Comparison Challenges: Direct throughput comparisons across different models, hardware, or even tokenizers can be approximate, as what constitutes a "token" can vary. Comparing cost per request is often a more robust measure of efficiency.
Latency-Throughput Trade-off: Like most software systems, inference services face a fundamental trade-off. Techniques like batching can boost throughput significantly (e.g., doubling or tripling it) but often at the cost of increased TTFT and TPOT (worse latency).
Goodput: A more user-centric metric that measures the number of requests per second (RPS) that successfully meet predefined Service-Level Objectives (SLOs).
Definition: An SLO is a specific performance target, like "TTFT must be less than 200 ms" or "TPOT must be less than 100 ms."
Contrast with Throughput: While throughput measures the total output capacity, goodput measures the useful capacity—how many requests are actually satisfying the user's performance expectations.
Example: Imagine an application with SLOs of TTFT ≤ 200 ms and TPOT ≤ 100 ms. If an inference service processes 100 requests per minute (high throughput), but only 30 of those requests meet both SLOs, then the goodput of that service is 30 requests per minute. This metric provides a direct measure of how well the service is meeting user-perceived performance requirements.
3. Hardware Utilization Metrics
Beyond latency and throughput, understanding how efficiently your hardware is being used is critical for cost-effectiveness and performance.
General Utilization: This refers to the proportion of a resource's total capacity that is actively engaged in processing tasks.
The Pitfall of Standard GPU Utilization (e.g., nvidia-smi):
What it Measures: Tools like
nvidia-smi
(SMI stands for System Management Interface) report "GPU Utilization" as the percentage of time the GPU is actively computing. For example, if a GPU is busy for 5 out of 10 hours, its utilization is 50%.The Misunderstanding: This metric doesn't reflect how much work is being done relative to the hardware's potential. A tiny task can keep a powerful GPU busy, reporting 100% utilization while performing only a fraction of its peak capability.
Why It's Not Useful for Optimization: If you're paying for a machine that can do 100 operations/second but it's only doing 1 operation/second (while reporting 100% "busy time"), you are wasting money and performance.
Model FLOP/s Utilization (MFU): This is a more meaningful metric for AI workloads. It measures the actual computational efficiency relative to the hardware's theoretical peak.
Definition: MFU is the ratio of the observed computational throughput (in FLOP/s) to the theoretical maximum FLOP/s the chip can achieve.
Formula:
$$\text{MFU} = \frac{\text{Achieved FLOP/s}}{\text{Peak Theoretical FLOP/s}}$$
Example: If a chip's peak FLOP/s allows it to generate 100 tokens/s, but your inference service only achieves 20 tokens/s, your MFU is 20%. This directly tells you how much of the chip's raw computational power is being effectively utilized for your specific task.
Model Bandwidth Utilization (MBU): Similar to MFU, MBU focuses on the efficient use of memory bandwidth, which is often a critical bottleneck, especially for LLMs.
Definition: MBU measures the percentage of the hardware's peak memory bandwidth that is actually consumed by the model's operations.
Calculation for LLM Inference:
Bandwidth Used:
Parameter-Count × Bytes/Parameter × Tokens/s
MBU Formula:
$$\text{MBU} = \frac{(\text{Parameter-Count} \times \text{Bytes/Parameter} \times \text{Tokens/s})}{\text{Peak Theoretical Bandwidth}}$$
- Example: A 7B parameter model in FP16 (2 bytes/parameter) running at 100 tokens/s uses approximately
7B × 2 × 100 = 1400 GB/s
of bandwidth. If we consider the example of an A100-80GB GPU's peak memory bandwidth is 2 TB/s (2000 GB/s), the MBU is(1400 GB/s) / (2000 GB/s)
= 70%.
- Example: A 7B parameter model in FP16 (2 bytes/parameter) running at 100 tokens/s uses approximately
Impact of Quantization: This example highlights why quantization is crucial. Using fewer bytes per parameter (e.g., INT4 instead of FP16) directly reduces the bandwidth requirement, improving MBU.
Relationship between Throughput, MFU, and MBU: MFU and MBU are directly proportional to throughput (tokens/s). Higher throughput achieved with the same hardware implies higher MFU and/or MBU.
Interpreting MFU and MBU:
Workload Type:
Compute-Bound Workloads: Tend to have higher MFU (using most of the FLOP/s) and lower MBU (memory bandwidth is not the bottleneck).
Memory Bandwidth-Bound Workloads: Tend to have lower MFU (processor is waiting for data) and higher MBU (using most of the available bandwidth).
Training vs. Inference: MFU for training is often higher than MFU for inference due to more predictable workloads and optimized batching strategies. For inference, prefill is typically compute-bound (higher MFU), while decode is memory-bound (higher MBU, lower MFU).
Good Utilization Benchmarks: For model training, an MFU above 50% is generally considered good. Achieving high MFU/MBU in inference can be challenging and depends heavily on the model, hardware, and specific optimization techniques used.
Figure 3: This figure shows that as the number of concurrent users increases (leading to a higher computational load per second), the MBU for Llama 2-70B decreases. This suggests that at higher concurrency, the workload might be shifting from being primarily bandwidth-bound to becoming more compute-bound, or that the available bandwidth is being saturated and cannot be further exploited.
The Goal of Optimization: While high utilization metrics are good indicators of efficiency, the ultimate goal is to achieve faster inference (lower latency) and lower costs. Simply maximizing utilization without improving these outcomes might be counterproductive. For instance, very high utilization achieved by sacrificing TTFT/TPOT could lead to a poor user experience.
4. KV Cache Size
KV Cache: A memory store used during autoregressive decoding to cache the Key (K) and Value (V) vectors of previous tokens, avoiding redundant computations.
KV Cache Size Calculation (Unoptimized):
$$2 \times B \times S \times L \times H \times M$$
Where:
B
: Batch sizeS
: Sequence lengthL
: Number of transformer layersH
: Model hidden dimensionM
: Bytes per parameter (e.g., 2 for FP16)
This value can become substantial as the context length increases. For example, LLama 2 13B has 40 layers and a model dimension of 5,120. With a batch size of 32, sequence length of 2,048, and 2 bytes per value, the memory needed for its KV cache, without any optimization, is 2 × 32 × 2,048 × 40 × 5,120 × 2 = 54 GB.
5. AI Accelerators
The speed and cost of running AI software are fundamentally dictated by the underlying hardware. While general optimization techniques exist, a deeper understanding of hardware enables more profound optimizations. This section focuses on AI hardware from an inference perspective, though many principles apply to training as well.
A Symbiotic History: AI and Hardware Evolution
- The development of AI models and the hardware to run them has been a tightly coupled dance. The limitations of computing power were a significant factor in the "AI winters" of the past. Conversely, breakthroughs in hardware, such as the early adoption of GPUs for deep learning with AlexNet in 2012, were instrumental in the resurgence of AI. GPUs offered a massive leap in parallel processing capability over CPUs, making large-scale neural network training accessible to researchers and sparking the deep learning revolution.
What is an AI Accelerator?
An AI Accelerator is a specialized piece of silicon designed to efficiently handle specific computational workloads associated with Artificial Intelligence.
Dominant Type: Graphics Processing Units (GPUs) are currently the most dominant type of AI accelerator, with NVIDIA being the leading economic force in this domain.
CPU vs. GPU: The Core Difference:
CPUs (Central Processing Units): Designed for general-purpose computing. They feature a few very powerful cores, optimized for high single-thread performance. CPUs excel at sequential tasks, complex logic, managing system operations (OS, I/O), and tasks that cannot be easily parallelized.
GPUs (Graphics Processing Units): Designed for highly parallel processing. They contain thousands of smaller, less powerful cores optimized for tasks that can be broken down into many identical, independent calculations. The quintessential example is matrix multiplication, which is fundamental to most machine learning operations and is inherently parallelizable.
Challenges of Parallelism: While GPUs offer immense computational power, their parallel nature introduces challenges in memory design (moving data efficiently to thousands of cores) and power consumption.
The Expanding Landscape of Accelerators
The success of GPUs has spurred the development of a diverse array of AI accelerators, including:
AMD GPUs: High-performance alternatives to NVIDIA GPUs.
Google TPUs (Tensor Processing Units): Custom-designed for neural network workloads, with a strong focus on tensor operations.
Intel Habana Gaudi: Designed for deep learning training and inference.
Graphcore IPUs (Intelligence Processing Units): Feature a unique architecture with a large amount of on-chip memory and a focus on graph-like computations.
Groq LPUs (Language Processing Units): Specialized for accelerating large language models.
Cerebras Wafer-Scale QPUs (Quant Processing Units): Massive chips designed for extreme parallelism.
Specialization for Inference
A significant trend is the emergence of chips specifically optimized for inference.
Inference Cost Dominance: Studies indicate that inference can often exceed training costs in deployed AI systems, accounting for up to 90% of total ML expenditure.
Inference Optimization Focus:
Lower Precision: Inference often benefits greatly from lower numerical precision (e.g., INT8, FP8) which reduces memory footprint and speeds up computation.
Memory Access: Faster memory access is critical for quickly loading model weights.
Latency Minimization: Unlike training which prioritizes throughput, inference often aims to minimize latency.
Examples of Inference-Specific Chips: Apple Neural Engine, AWS Inferentia, MTIA (Meta Training and Inference Accelerator).
Edge Computing Accelerators: Chips designed for devices with limited power and computational resources, such as Google's Edge TPU and NVIDIA Jetson Xavier series.
Architecture-Specific Accelerators: Some chips are tailored for specific model architectures, like transformers.
Hardware Architectures and Compute Primitives
Different hardware architectures feature distinct memory layouts and specialized compute units optimized for various data types:
Compute Primitives: These are the basic operations a chip's hardware is designed to perform efficiently. Common primitives include:
Scalar Operations: Processing single data points.
Vector Operations: Processing arrays of data.
Tensor Operations: Processing multi-dimensional arrays (matrices and higher-order tensors), crucial for neural networks.
Figure 4: Illustrates different compute primitives. While traditional CPUs excel at scalar operations, GPUs have strong vector capabilities, and specialized AI accelerators (like TPUs) are built around tensor operations.
Chip Design: A chip might combine these units. GPUs traditionally supported vector operations, but many modern GPUs, for example, have evolved to include "Tensor Cores" specifically optimized for matrix and tensor computations. TPUs, conversely, are designed with tensor operations as their primary focus. To maximize efficiency, a model's operations need to align with the chip's strengths. A chip’s specifications contain many details that can be useful when evaluating this chip for each specific use case.
Key Evaluation Characteristics of Accelerators
When evaluating hardware for AI workloads, several characteristics are paramount:
Computational Capabilities:
Metric: Measured in FLOP/s (Floating-Point Operations Per Second), often expressed in teraFLOPs (TFLOPS) or petaFLOPs (PFLOPS).
Precision Dependence: Higher numerical precision (e.g., FP32 vs. FP16 vs. FP8) requires more computation per operation, leading to fewer operations per second.
Theoretical Peak vs. Actual: The advertised FLOP/s is a theoretical maximum. Actual performance (MFU) depends on how efficiently the workload can utilize the hardware.
Table Example (NVIDIA H100 SXM): Demonstrates how FLOP/s scales with precision, with FP8 offering the highest theoretical throughput when sparsity is utilized.
Memory Size and Bandwidth:
Importance: With thousands of parallel cores, efficient data movement is critical. Large AI models and datasets require fast access to memory to keep these cores busy.
GPU Memory Technologies: GPUs typically use High-Bandwidth Memory (HBM), a 3D stacked memory technology, which offers significantly higher bandwidth and lower latency compared to the DDR SDRAM used in CPUs (which has a 2D structure). This is a key reason for higher GPU memory costs.
Memory Hierarchy: An accelerator’s memory is measured by its size and bandwidth. These numbers need to be evaluated within the system an accelerator is part of. Accelerators interact with multiple memory tiers, each with different speeds and capacities (as visualized in Figure 7):
CPU DRAM (System Memory): Lowest bandwidth (25-50 GB/s), largest capacity (1TB+ possible). Used as a fallback.
GPU HBM: High bandwidth (256 GB/s to 1.5 TB/s+), moderate capacity (24-80 GB typical for consumer/prosumer GPUs). This is where model weights and activations are primarily stored.
GPU On-Chip SRAM (Caches): Extremely high bandwidth (10 TB/s+), very small capacity (tens of MB). Used for immediate access to frequently used data.
Framework Limitations: A lot of GPU optimization is about how to make the most out of this memory hierarchy. However, current popular frameworks (PyTorch, TensorFlow) offer limited direct control over this memory hierarchy, prompting interest in lower-level programming languages like
CUDA (Compute Unified Device Architecture)
,Triton
, andROCm (Radeon Open Compute)
.
Power Consumption:
Transistor Switching: Chips rely on transistors to perform computation. Each computation is done by transistors switching on and off, which requires energy. A GPU can have billions of transistors—an NVIDIA A100 has 54 billion transistors, while an NVIDIA H100 has 80 billion. When an accelerator is used efficiently, billions of transistors rapidly switch states, consuming a substantial amount of energy and generating a nontrivial amount of heat. This heat requires cooling systems, which also consume electricity, adding to data centers’ overall energy consumption.
Environmental Impact: The massive energy consumption of data centers powering AI is a growing concern, driving demand for energy-efficient hardware and "green data center" technologies.
Metrics:
Maximum Power Draw: The absolute peak power a chip can consume.
TDP (Thermal Design Power): A proxy for power consumption, representing the heat a cooling system must dissipate under typical workloads. For CPUs and GPUs, the actual power draw can exceed TDP 1.1 to 1.5 times.
Cloud vs. On-Prem: Cloud users are insulated from direct cooling/electricity costs but should still consider the environmental impact.
Selecting the Right Accelerators
The choice of accelerator hinges on the specific workload:
Compute-Bound Workloads: Prioritize chips with higher FLOP/s.
Memory-Bound Workloads: Focus on chips with higher memory bandwidth and larger memory capacity.
When making a selection, consider these core questions:
Can it run the workload? (Does it have enough compute and memory?)
How long will it take? (What are the expected latency and throughput?)
How much does it cost? (Initial purchase, ongoing power, or cloud rental fees.)
FLOP/s
,memory size
, andbandwidth
are key to answering the first two questions, whilecost
is usually more straightforward, though it includes power and cooling for on-premise solutions.
Intuitive Explanations
Compute-Bound vs. Memory-Bound: The Master Chef Analogy
Imagine a master chef (the Processor/GPU Core) in a massive kitchen.
Compute-Bound Task (e.g., Complex Sauce Reduction): The chef is furiously chopping, mixing, and tasting. The recipe is complex and requires immense skill and speed. The limiting factor is the chef's own speed. Kitchen assistants bringing ingredients are waiting on the chef. This is like the prefill phase of an LLM, where massive parallel matrix multiplications max out the GPU's computational power.
Memory-Bound Task (e.g., Assembling a Simple Salad): The recipe is simple: grab lettuce, tomatoes, and dressing. The chef can assemble it instantly but has to wait for an assistant to run to the pantry (HBM Memory) and back for each ingredient. The limiting factor is the assistant's speed (the memory bandwidth). The chef is mostly idle, waiting for data. This is like the decode phase of an LLM, where for each new token, the huge model weights must be read from memory.
Figure 5: The initial processing of the prompt ("Prefill") is a parallel, compute-intensive task. The subsequent generation of each token ("Decode") is a sequential, memory-intensive task.
TTFT vs. TPOT: The User Experience of Waiting
TTFT (Time to First Token) is like asking a question and waiting for the first word of the answer. A low TTFT makes an application feel responsive and "alive." For a chatbot, this is crucial.
TPOT (Time Per Output Token) is the speed at which the rest of the answer is typed out. As long as it's faster than human reading speed (around 6-8 tokens/sec), the experience feels smooth. A very fast TPOT might not be noticeable, but a slow one is frustrating.
Figure 6: This illustrates "goodput." Even if a system processes many requests, only those meeting latency SLOs (e.g., TTFT < 200ms, TPOT < 100ms) count towards goodput. The dark green bars represent requests that failed the SLO. If an inference service can complete 10 RPS but only 3 satisfy the SLO, then its goodput is 3 RPS.
Hardware Memory Hierarchy: The Researcher's Desk
Think of a researcher working on a project. Their access to information has different speeds and capacities.
GPU SRAM (On-Chip Cache): This is the researcher's own brain and the sticky notes right in front of them. Blazingly fast access (10+ TB/s) but very limited space (tens of MB).
GPU HBM (High-Bandwidth Memory): These are the books and papers laid out on their desk. Fast to grab (1.5 TB/s) and holds a decent amount (40-80 GB). This is where the model weights live.
CPU DRAM (System Memory): This is the library down the hall. Huge capacity (up to 1 TB+) but slow to access (25 GB/s). You only go there when you absolutely have to.
Figure 7: The memory hierarchy shows a trade-off: the fastest memory (SRAM) has the smallest capacity, while the largest memory (DRAM) is the slowest. Effective optimization is about keeping the most needed data in the fastest possible tier.
Examples & Techniques Deep Dive
1. Model-Level Optimization
This is like making the arrow itself more aerodynamic.
Model Compression:
Quantization: Using fewer bits per weight (e.g., FP16 -> INT8). It's like writing your book with simpler words to make it shorter. The most popular and effective technique.
Pruning: Setting less important weights to zero, creating a "sparse" model. Like redacting non-essential sentences from a book. Harder to get right and requires hardware support for sparsity to be effective.
Distillation: Training a smaller "student" model to mimic a larger "teacher" model. Like creating a concise summary of a dense textbook.
Overcoming Autoregressive Decoding: This addresses the "one token at a time" bottleneck.
Speculative Decoding: A "fast but dumb" draft model generates several tokens. The "slow but smart" target model verifies them all in one parallel step.
- Analogy: An eager junior associate (draft model) writes a paragraph. The senior partner (target model) quickly skims and approves it, which is much faster than writing it from scratch.
Parallel Decoding (e.g., Medusa): Using multiple "prediction heads" to guess several future tokens at once, then verifying and integrating them.
- Analogy: Instead of writing a sentence word-by-word, you jot down several possible next words simultaneously and then pick the best sequence. This requires modifying the model architecture.
Figure 8: Medusa uses extra "heads" to predict multiple future tokens in parallel. These predictions form a tree of possibilities, and the best path is chosen in a single step, accelerating generation. Each head predicts several options for a token position. The most promising sequence from these options is selected.
Attention Mechanism Optimization: The attention mechanism's cost grows quadratically with sequence length, and the KV Cache is its biggest memory hog.
KV Cache: A brilliant hack to avoid re-calculating attention for past tokens. But for long contexts, it can become larger than the model itself! NOTE: A KV cache is used only during inference, not training. During training, because all tokens in a sequence are known in advance, next token generation can be computed all at once instead of sequentially, as during inference. Therefore, there’s no need for a KV cache.
Redesigning Attention:
- Multi-Query/Grouped-Query Attention (MQA/GQA): Instead of each attention head having its own K and V vectors, multiple heads share them. This dramatically shrinks the KV cache with minimal quality loss. A key innovation in models like Llama 2/3.
Kernel Fusion (e.g., FlashAttention): A low-level software optimization. Instead of performing the many steps of attention (Matmul, Mask, Softmax, Dropout, Matmul) by loading data from memory each time, FlashAttention "fuses" them into one GPU kernel. This minimizes slow memory I/O and keeps the computation on the fast on-chip SRAM.
Figure 9: Standard PyTorch attention involves multiple separate operations, each requiring a slow round-trip to GPU memory. FlashAttention fuses these into a single, highly optimized kernel, drastically reducing execution time.
2. Service-Level Optimization
This is like improving the archer's entire shooting process.
Batching: Grouping multiple user requests to process them together, improving hardware utilization and throughput.
Static Batching: "The bus doesn't leave until it's full." Simple, but can lead to high latency for the first requests.
Dynamic Batching: "The bus leaves when it's full OR every 10 minutes." A good balance between utilization and latency.
Continuous Batching (In-flight batching): The "magic" behind systems like vLLM. "As soon as a passenger gets off, a new one can get on." When a request in a batch finishes, a new request from the queue immediately takes its slot. This maximizes throughput without penalizing requests that generate short responses.
Figure 10: In Normal Batching, the entire batch must wait for the longest request (R7) to finish. In Continuous Batching, requests (R1, R2, R3, R4) are processed independently, and new requests (R5, R6) can start as soon as slots open up, dramatically improving efficiency.
Decoupling Prefill and Decode: Since prefill is compute-bound and decode is memory-bound, using different hardware (or different ratios of hardware) for each step prevents them from competing for the same resources. For example, you might use a few powerful GPUs for the compute-heavy prefill and many cheaper GPUs with high-bandwidth memory for the decode step.
Prompt Caching: If many requests share a common prefix (like a long system prompt or a document for RAG), the processed state (the KV cache) for that prefix can be cached and reused. This turns a massive prefill operation into a tiny one for subsequent requests.
Parallelism:
Replica Parallelism: This is the most straightforward strategy to implement. It simply creates multiple replicas of the model you want to serve. More replicas allow you to handle more requests at the same time, potentially at the cost of using more chips. Often, your model is so big that it can’t fit into one machine. Model parallelism refers to the practice of splitting the same model across multiple machines. Fitting models onto chips can become an even more complicated problem with model parallelism.
Tensor Parallelism (also known as intra-operator parallelism): Splitting a single large operation (like a matrix multiplication) across multiple GPUs. This is essential for running models that are too big for one GPU and also reduces latency.
Pipeline Parallelism: Splitting the model layers across multiple GPUs.
GPU 1
handles layers 1-10,GPU 2
handles layers 11-20, etc. This is great for throughput but can increase latency due to communication overhead between GPUs.FSDP (Fully Sharded Data Parallel): A memory-efficient training strategy that shards model weights, gradients, and optimizer states across multiple GPUs. Unlike traditional data parallelism (where each GPU stores a full model copy), FSDP ensures that no single GPU holds the entire model at once. This allows you to train larger models on fewer GPUs by minimizing memory overhead and reducing redundancy in communication.
Insights & Relationships
The Bottleneck Defines the Solution: A memory-bound problem (decoding) won't be solved by a more powerful processor (more FLOP/s). It needs higher memory bandwidth or techniques that reduce memory traffic (like FlashAttention and KV cache quantization).
Latency vs. Throughput Trade-off: Almost every optimization forces a choice. Batching increases throughput (lower cost per request) but can increase latency. You must optimize for the metric that matters most to your users.
Hardware and Software Co-design: The most advanced optimizations (like FlashAttention) are born from a deep understanding of the hardware architecture (memory hierarchy, compute units). This is why companies like NVIDIA, Google, and OpenAI develop software (CUDA, Triton, XLA) alongside their hardware.
The Rise of Inference-Specific Solutions: As training becomes more centralized, the real-world cost of AI is shifting to inference. This drives innovation in inference-specific hardware (e.g., AWS Inferentia), architectures (MQA/GQA), and serving systems (vLLM, TGI).
Quantization is the Low-Hanging Fruit: It is the easiest, most reliable, and often most impactful optimization. An 8-bit quantized model uses half the memory and bandwidth of its 16-bit version, often with negligible quality loss.
Figure 11: This chart shows the compounding effect of different optimization techniques on a Llama-7B model. Each step—compiling, quantizing, and adding speculative decoding—provides a significant boost in throughput (tok/s/user).
Practice Questions / Flashcards
Q: Why is the LLM prefill phase typically compute-bound?
- A: Because it processes all input tokens simultaneously in large, parallel matrix multiplications, which maxes out the GPU's computational capacity (FLOP/s).
Q: Why is the LLM decode phase typically memory bandwidth-bound?
- A: Because for each token, it must load the entire set of model weights from slow HBM memory. The computation per token is small, so the bottleneck is data movement speed, not calculation speed.
Q: What is the key difference between
nvidia-smi
's "GPU Utilization" and MFU (Model FLOP/s Utilization)?- A:
nvidia-smi
utilization only shows if the GPU is active (busy), not if it's being used efficiently. A GPU can be 100% busy but only performing 1% of its peak FLOP/s. MFU measures the actual efficiency by comparing achieved FLOP/s to the theoretical maximum.
- A:
Q: How does Speculative Decoding speed up inference without changing the final output of the target model?
- A: It uses a smaller, faster "draft" model to generate candidate tokens, which are then verified in a single, parallel pass by the larger, accurate "target" model. Since verification is parallelizable and faster than sequential generation, this accelerates the process. The final output is always what the target model would have produced.
Q: What is the primary purpose of the KV Cache, and what is its main drawback?
- A: Its purpose is to store the key/value vectors from the attention mechanism for all previous tokens, avoiding expensive re-computation at each new step. Its main drawback is its massive memory consumption, which grows linearly with sequence length and batch size and can become the main bottleneck for long-context applications.
Q: Explain the difference between Tensor Parallelism and Pipeline Parallelism.
- A: Tensor Parallelism splits a single operation (like a matrix multiplication) across multiple devices, reducing latency. Pipeline Parallelism splits the model's layers across devices, with each device handling a different stage. This increases throughput but adds latency due to inter-device communication.
Q: Your chatbot feels slow to start answering but then generates text quickly. Which metric would you focus on improving: TTFT or TPOT?
- A: You would focus on improving TTFT (Time to First Token), as this metric governs the initial response latency that the user perceives as "slowness to start."
References/Further Reading
Roofline Model: Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures.
FlashAttention: Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
PagedAttention (vLLM): Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention.
Speculative Decoding: Chen, X., et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling.
Continuous Batching (Orca): Yu, G., et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models.
Medusa: Cai, T., et al. (2024). Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads.
Subscribe to my newsletter
Read articles from Gasym A. Valiyev directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Gasym A. Valiyev
Gasym A. Valiyev
Intelligent Systems Engineer specializing in Robotics & AI. Expertise in ML/DL, Python/C++, LLMs, RAG, and end-to-end intelligent systems development. Passionate about solving complex problems with AI.