Faster Inference in Large Language Models (LLMs)


Large Language Models (LLMs), such as OpenAI’s GPT, Meta’s LLaMA, and Google’s PaLM, have revolutionized natural language understanding and generation. These models power a wide range of applications from virtual assistants and real-time chatbots to code generation tools, document summarizers, and research co-pilots. However, their deployment in production systems comes with a significant challenge: latency.
The Problem: Latency and Resource Bottlenecks
Despite their impressive performance, LLMs are computationally expensive. Inference, the act of generating output text from a given prompt, is particularly slow due to several reasons:
Autoregressive generation: LLMs generate one token at a time, making it difficult to parallelize the decoding process.
High parameter count: Models like GPT-3 and LLaMA-65B contain billions of parameters, requiring massive compute resources.
Memory-intensive attention: Self-attention mechanisms have quadratic complexity with respect to sequence length.
Large prompt windows: Prompt engineering often involves feeding long contexts, increasing the per-token cost of inference.
All of these reasons cause friction for real-time applications, multi-user systems, and on-device deployments. Reducing inference latency, without sacrificing model accuracy, has become a critical area of research and engineering.
Goal to Achieve Low-Latency, Scalable LLM Inference
Achieving low-latency and scalable LLM inference involves architectural improvements, model compression, software-level optimizations, and leveraging hardware acceleration. To be more specific, the goal of fast LLM inference is twofold:
Minimize response time per token (latency)
Maximize throughput across users and tasks (efficiency & scalability)
** ‘Throughput’ here means the amount of data that can be successfully transferred or processed within a specific time period, despite the delay caused by latency.
Techniques for Faster LLM Inference
Infrastructure Based Techniques
1. Key-Value (KV) Caching
This stores the attention keys and values from past tokens during autoregressive generation. The KV cache in transformer models grows linearly with the sequence length. For each new token processed during autoregressive generation or inference, the model computes and stores its corresponding key and value vectors for each attention layer. These vectors are then used in subsequent steps to compute attention with past tokens efficiently, avoiding redundant recomputation. As a result, for a model with L layers and a hidden dimension of d, the memory required for the KV cache increases proportionally to the number of tokens processed, i.e., O(L × T × d), where T is the sequence length. This linear growth in memory usage can become significant for long sequences, especially in large models with many layers or large hidden dimensions, necessitating memory optimization strategies for practical deployment. This avoids recomputing attention for previous tokens in each forward pass.
Tools/Uses:
Hugging Face
transformers
withuse_cache=True
vLLM (uses paged KV cache)
2. Speculative Decoding
Token speculation is an inference optimization technique that involves using a smaller, faster "draft" model to generate multiple tokens in parallel, which are then verified by a larger, more accurate model. This approach reduces the latency of traditional token-by-token decoding by allowing speculative execution of likely continuations. During this process, the draft model proposes a sequence of tokens, and the verifier model checks whether those tokens match what it would have generated. If the tokens are valid, the process continues; if not, the system falls back and retries from the last verified token, often using a retry strategy that balances efficiency and correctness. This multi-model inference setup enables significant speed-ups, especially in large language models, by combining the speed of smaller models with the accuracy of larger ones.
Tools/Uses:
NVIDIA’s TensorRT-LLM
Hugging Face Transformers (WIP support)
3. Quantization
Quantization is a technique that reduces the precision of a model’s weights and activations—such as converting from FP32 to INT8 or from FP16 to INT4—to decrease memory usage and accelerate inference, particularly on CPUs and edge devices. There are two main approaches: Post-Training Quantization (PTQ), which applies quantization after the model has been trained, and Quantization-Aware Training (QAT), which simulates low-precision arithmetic during training to better preserve model accuracy. Techniques like GPTQ (GPT Quantization) and AWQ (Activation-aware Weight Quantization) are designed specifically for large language models, enabling more aggressive quantization with minimal accuracy loss. While quantization can significantly reduce latency and resource consumption, it often introduces a trade-off in the form of reduced model accuracy, making the choice of quantization method and bit-width a balance between performance and precision.
Tools/Uses:
GPTQ
AutoAWQ
Hugging Face
bitsandbytes
Intel Neural Compressor / ONNX Runtime
4. Model Compilation / Graph Optimization
Model compilation transforms high-level models into optimized kernels tailored for specific hardware, significantly improving inference speed and efficiency. By converting models using tools like TorchScript, ONNX, TensorRT, or TVM, this process bypasses Python runtime overhead, fuses multiple operations into a single kernel, and leverages hardware-specific instructions for acceleration. It involves converting dynamic computation graphs—where operations are defined at runtime—into static graphs, which allow more aggressive optimizations and better scheduling. Operator fusion and kernel-level acceleration are key techniques used in this process to reduce memory access and computational redundancy, resulting in faster and more resource-efficient execution, particularly on GPUs and specialized accelerators.
Tools:
NVIDIA TensorRT-LLM
TorchDynamo / TorchInductor
GGML (CPU-only, super fast)
DeepSpeed inference engine
OpenVINO (for Intel HW)
System-Level and Architectural Optimizations
5. FlashAttention
FlashAttention is a memory-efficient and CUDA-optimized implementation of the attention mechanism that significantly reduces memory bandwidth bottlenecks during transformer inference and training. Unlike standard attention implementations that materialize large intermediate matrices, FlashAttention leverages block-sparse and fused attention techniques to compute attention in a tiled, GPU-friendly manner—streaming blocks of queries, keys, and values directly through shared memory and registers. This dramatically cuts down memory overhead and accelerates performance, especially on long sequences. FlashAttention v1 introduced the core idea of avoiding unnecessary memory reads/writes, while FlashAttention v2 further improved performance by supporting more flexible configurations, better GPU utilization, and compatibility with newer architectures. Understanding these versions helps optimize large-scale models for both training and deployment on modern hardware.
Tools:
FlashAttention2
xFormers
6. Rotary Position Embeddings (RoPE) and Linear Attention
Alternative positional embeddings are designed to improve a model's ability to generalize to longer sequences while maintaining efficiency during inference. Unlike traditional absolute or relative positional encodings, Rotary Positional Embeddings (RoPE) encode positions through rotations in the query and key space, enabling better extrapolation and more effective use of cached key-value pairs during autoregressive decoding. This leads to more robust performance on sequences longer than those seen during training. In parallel, linear attention approximations such as Performer and Longformer reduce the quadratic complexity of standard attention mechanisms, making long-sequence processing feasible by using kernel tricks or sparse patterns. Additionally, models like Mamba and other state-space models introduce fundamentally different architectures that model long-range dependencies with linear time and memory complexity, offering promising alternatives to traditional transformers for scaling to very long contexts.
Tools:
LLaMA.cpp: Implements RoPE and allows positional scaling for extended contexts with minimal degradation.
xPos: A research project introducing extrapolatable position embeddings, can be adapted into Hugging Face models.
Together, these techniques represent a robust toolkit for building faster, leaner, and more scalable LLM systems. By strategically combining these methods, tailored to the deployment environment and application constraints, developers can unlock the full potential of LLMs in both cloud and edge settings, ensuring responsiveness, cost-efficiency, and broad accessibility in production-grade AI systems.
Subscribe to my newsletter
Read articles from Trinita Roy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Trinita Roy
Trinita Roy
I work at Bosch. Prior to this, I was a Gen - AI researcher at Fraunhofer IPA, and an AI/ML Research Scientist with 2 years of experience at SciSpace, currently pursuing my 2nd Masters in Computational Linguistics at the University of Stuttgart. I bring a robust foundation in applied NLP, Large Language Models (LLMs), Gen AI, Information Retrieval, and Retrieval Augmented Generation (RAG) pipelines. I am adept at optimizing and deploying cutting-edge AI infrastructure. I am passionate about creating impactful AI solutions by contributing to research endeavours.