The deployment of Large Language Models like Gemma, Llama, and Mistral into production systems bring a lot of engineering challenges, mainly around things like latency, throughput, and memory efficiency. As models grow larger and user demand increases, traditional serving systems like using transformers library often fail under the pressure. So in production we engineers often use vLLM, an open-source library designed specifically to optimize LLM inference serving. Its effectiveness comes largely from two core innovations: PagedAttention and Continuous Batching. I’ll try explaining these techniques in my own way, exploring the problems they solve, their underlying mechanisms, and the impact they create together on LLM serving performance.

The Main Problem: The KV Cache Bottleneck

Before understanding vLLM's solutions, we must first understand the main performance bottleneck in autoregressive LLM inference: the Key-Value (KV) Cache.

Autoregressive Generation and the KV Cache:

LLMs generate text token by token. In each step, the model takes the sequence generated so far (the prompt plus previously generated tokens) and predicts the next token. The main computation heavy part for this is the attention mechanism (specifically, self-attention in decoders).

The attention mechanism calculates alignment scores between tokens. For a token at position i, it attends to all previous tokens 0...i. This calculation involves Query (Q), Key (K), and Value (V) vectors derived from the token embeddings. Crucially, the K and V vectors for tokens 0...i-1 are needed to compute the attention output for token i. Computing these K and V vectors again for all previous tokens at every generation step would be too expensive to actually do.

The KV Cache is the optimization that stores these K and V vectors once computed. When generating the token at position i, we only need to compute the Q, K, and V vectors for token i-1 and then get the K and V vectors for tokens 0...i-2 from the cache.

KV Cache Characteristics and Challenges:

Memory Consumption: The KV cache is enormous. For a model like Llama-7B, the cache size for a single sequence of length L can be calculated as: 2 * num_layers * num_heads * head_dim * L * precision_bytes. This easily goes to GBs in size per sequence for long contexts.
Dynamic Growth: The cache size for a given request also grows along with the length of the generated sequence. This dynamic nature makes memory management difficult.
Lifespan: The cache is needed only for the duration of a single request's generation process.

Traditional Memory Management Issues:

Most early LLM serving systems allocated GPU memory for the KV cache in contiguous blocks per sequence. This leads to some big problems:

Internal Fragmentation: If memory is allocated upfront for the maximum possible sequence length, but a request generates a shorter sequence, the unused portion of the allocated block is wasted (internal fragmentation).
External Fragmentation: Due to the dynamic allocation and deallocation of variable-sized cache blocks for different requests, the GPU memory becomes fragmented over time. Even if there's enough total free memory, it might be scattered in small, non-contiguous chunks, preventing the allocation of a large enough block for a new request (external fragmentation). This highly limits the number of concurrent requests the system can handle.
High Overhead: Frequent allocation/deallocation of large memory blocks incurs a lot of overhead.
Copying Penalties: Operations like beam search, where many candidate sequences share a common prefix, often require copying the entire KV cache for the shared prefix when sequences diverge, leading to redundant memory usage and copying costs.

Continuous Batching: Maximizing GPU Utilization

Normally most LLM serving still uses static batching. Requests are grouped into a batch, padded to the length of the longest sequence in the batch, and processed together. The system waits for all sequences in the batch to finish generation before starting the next batch.

Limitations of Static Batching:

Head-of-Line Blocking: Fast-completing requests in a batch are stuck waiting for the slowest request to finish.
Wasted Computation: A lot of computation is wasted on padding tokens.
Low GPU Utilization: The GPU often sits idle when it is waiting for a full batch to assemble or for the current batch to complete.

Continuous Batching (also known as dynamic batching or iteration-level scheduling):

vLLM implements Continuous Batching to address these issues. The main idea is to operate at the granularity of individual generation steps rather than entire sequences:

Request Queue: Incoming requests are placed in a waiting queue.
Dynamic Batch Composition: At each forward pass (generation step), the scheduler checks the queue and adds new requests to the currently running batch if there is capacity (GPU memory, compute).
Step-wise Execution: The GPU performs one decoding step for all sequences currently in the batch.
Completion Handling: If a sequence finishes generation (e.g., outputs an EOS token), its resources are immediately freed, and its slot in the batch can then be filled by a new request in the next step.
No Waiting: The system doesn't wait for the entire batch to finish; it continuously processes steps for the active sequences and takes in a new requests as soon as resources allow.

Benefits of Continuous Batching:

Increased Throughput: By keeping the GPU constantly busy with active sequences and minimizing idle time, throughput is improved a lot.
Reduced Average Latency: Requests start processing sooner and finish faster as they don't wait for slower requests in a static batch.
Better GPU Utilization: The GPU performs useful work more consistently.

However, Continuous Batching makes the memory management problem worse. The dynamic addition and removal of sequences with varying lengths make efficient, non-fragmented memory allocation even more difficult. This is where PagedAttention comes in.

PagedAttention: Virtual Memory for the KV Cache

PagedAttention, inspired by virtual memory and paging techniques in operating systems, fundamentally changes how the KV cache memory is managed.

The Core Idea: Non-Contiguous Blocks

Instead of allocating a large, contiguous memory block for each sequence's KV cache, PagedAttention divides the cache into smaller, fixed-size blocks.

Physical Blocks: The total GPU memory dedicated to the KV cache is divided into many physical blocks of a predefined size (e.g., 16KB).
Logical Blocks: Each sequence's KV cache is represented as a sequence of logical blocks.
Block Table: A mapping mechanism, just like a page table in an OS, maps the logical blocks of a sequence to the physical blocks in GPU memory. Most importantly, these physical blocks do not need to be contiguous.

How PagedAttention Works:

Allocation: When a new token is generated for a sequence, its K and V vectors need to be stored. The vLLM scheduler/memory manager sees if the sequence's last logical block has enough space.
- If yes: The new KV vectors are written into the corresponding physical block.
- If no: A new physical block is allocated from a pool of free blocks, the block table for the sequence is updated to map the next logical block to this new physical block, and the KV vectors are written there.
Attention Calculation: During the attention computation for a sequence, the GPU kernel uses the sequence's block table to fetch the required K and V vectors from the appropriate (potentially non-contiguous) physical blocks. This needs changes to the attention kernels to handle this indirect addressing through the block table.
Deallocation: When a sequence finishes, its physical blocks are released back to the free pool by simply updating their status so no large memory deallocation call is needed.

Solving Memory Fragmentation:

Eliminates External Fragmentation: Since blocks are fixed-size and allocated non-contiguously, the system can use any available free block. The problem of needing a large contiguous chunk disappears. This allows packing many more sequences into the same amount of memory.
Minimizes Internal Fragmentation: Internal fragmentation is limited to the last block allocated for each sequence. Since block sizes are relatively small compared to the total cache size, this waste is significantly reduced compared to reserving space for the maximum sequence length.

Enabling Efficient Memory Sharing:

PagedAttention unlocks almost zero-cost memory sharing for sequences with common prefixes. This is specially powerful for:

Parallel Sampling: Generating multiple outputs from the same prompt (e.g., n > 1). All n sequences share the same prompt KV cache. With PagedAttention, their block tables initially point to the same physical blocks for the prompt section.
Beam Search: Multiple candidate beams share a common history. Similar to parallel sampling, the block tables for different beams can point to the same physical blocks for the shared prefix.
Copy-on-Write: When a shared sequence diverges (e.g., different tokens are generated in parallel sampling or beam search), only the new block needs to be allocated separately. If a modification to a shared block is needed (less common in standard inference but theoretically possible), PagedAttention can implement copy-on-write: allocate a new block, copy the contents of the original shared block, make the modification, and update the diverging sequence's block table to point to the new block. This avoids copying the entire shared prefix cache, greatly cutting down memory overhead and latency for these techniques.

Combination of PagedAttention + Continuous Batching

The true power of vLLM comes from how PagedAttention and Continuous Batching work together:

Flexible Memory for Dynamic Batches: PagedAttention provides the fine-grained, flexible memory management required by Continuous Batching. As sequences of varying lengths enter and leave the batch dynamically, PagedAttention efficiently allocates and frees the small blocks needed for each step without causing fragmentation.
Higher Concurrency: By removing fragmentation and enabling efficient sharing, PagedAttention lets many more sequences (and thus longer total context) to fit into the same amount of GPU memory. This directly results to larger effective batch sizes for Continuous Batching, further improving the throughput.
Reduced Overhead: Allocating/freeing small, fixed-size blocks is much faster than managing large, variable-sized contiguous chunks. This low overhead is very important for the fast, step-by-step adjustments made by Continuous Batching.

Implementation Details and Considerations

Block Size: Choosing the block size involves a trade-off. Smaller blocks reduce internal fragmentation but increase the overhead of the block table (more entries) and potentially reduce memory access locality. Larger blocks reduce table overhead but increase potential waste in the last block.
Attention Kernels: Custom attention kernels are required that can read the block tables and gather KV vectors from non-contiguous physical blocks. vLLM includes highly optimized CUDA kernels for this purpose.
Scheduler: The vLLM scheduler becomes more complex. It must manage the allocation/deallocation of blocks, track free blocks, and make scheduling decisions based not just on compute availability but also on fine-grained memory availability (number of free blocks).
Memory Manager: A central component responsible for managing the pool of physical blocks and handling allocation/free requests.

A Very Big Shift in LLM Serving

vLLM, powered by PagedAttention and Continuous Batching, was a big step in LLM inference optimization. By tackling the critical KV cache memory management problem directly with techniques inspired by operating systems, PagedAttention removes memory fragmentation and enables efficient sharing. Continuous Batching takes advantage of this efficient memory system to maximize GPU utilization and minimize latency by dynamically managing requests at the iteration level. Together, these innovations allow vLLM to achieve state-of-the-art throughput and handle larger batch sizes and longer sequences compared to systems like HF transformers, all while using GPU memory more effectively.

How vLLM does it?

Table of contents