Novita AI Evaluates FlashMLA on H100 and H200


DeepSeek has officially kicked off its five-day open source release initiative, with the first featured project being FlashMLA. FlashMLA is an optimized, high-efficiency MLA decoding kernel specifically designed for NVIDIA Hopper GPUs (e.g., H800 SXM5). Its primary goal is to accelerate computations for large-scale models, particularly enhancing performance on NVIDIA's high-end GPUs.
As a leading provider of AI infrastructure, Novita AI was among the first to evaluate FlashMLA's performance across mainstream Hopper GPUs (H100, H200).
What is MLA?
Before diving into the evaluation results, let’s take a moment to understand some relevant background concepts.
Hopper GPU: NVIDIA's next-generation high-performance GPU architecture, engineered for AI and high-performance computing (HPC). Built with advanced process technologies and an innovative architecture, Hopper GPUs deliver exceptional performance and energy efficiency for complex computational tasks. The mainstream Hopper GPUs include H100 and H200.
Decoding Kernel: A hardware or software module specifically designed to accelerate decoding tasks. In AI inference, decoding kernels significantly enhance the speed and efficiency of model inference, particularly when processing sequential data.
Key-Value (KV) Pairs
Key:
Represents a compressed version of the input data, used to compute attention weights (how much focus to place on different parts of the input).
Example: In text generation, keys help the model identify which words in a sentence are most relevant to the current word being generated.
Value:
Contains the actual information associated with each input token, weighted by the attention scores.
Example: Values store the semantic meaning of words, which are combined based on attention weights to produce the output.
MLA (Multi-head Latent Attention): A novel attention mechanism that requires lighter KV (key-value) caching, making it more scalable for long-sequence processing. MLA outperforms traditional Multi-Head Attention (MHA) mechanisms in both scalability and performance.
MHA VS MQA VS GQA VS MLA
Module | Technical Logic | Inference Speed | Model Performance |
MHA | Multiple heads independently generate keys and values with no sharing (full-dimensional computation). | ⭐️ | ⭐️⭐️⭐️ |
MQA | All query heads share a single key-value pair (single KV group). | ⭐️⭐️⭐️ | ⭐️ |
GQA | Query heads share key-value pairs in groups (multiple KV groups). | ⭐️⭐️ | ⭐️⭐️ |
MLA | Key-value pairs are compressed into low-dimensional latent vectors and decoded with decoupled RoPE to retain positional information. | ⭐️⭐️⭐️⭐️ | ⭐️⭐️⭐️⭐️ |
MQA/GQA: A "simplified version" of MHA, focusing on efficiency at the cost of information loss.
MLA: An "upgraded compressed version" that balances memory efficiency and information retention, even outperforming MHA.
Architectural Innovation: MLA is not a mere optimization but a reimagining of attention mechanisms, leveraging latent variables to reconstruct them mathematically. It achieves the best of both worlds: efficiency and capability.
FlashMLA Performance Evaluation by Novita AI
DeepSeek has announced that FlashMLA achieves a memory bandwidth limit of 3000 GB/s and a compute limit of 580 TFLOPS on the H800 SXM5 GPU. To validate these claims, Novita AI conducted a comprehensive evaluation, testing FlashMLA under various parameter configurations.
To present the results more intuitively, the horizontal axis in the performance charts represents the following parameter configurations:
Batch Size
Sequence Length
Number of Attention Heads
Note
These results are based on the official test scripts. Without knowledge of the optimal parameter configurations, the data may not fully reflect theoretical maximums.
What Impact Will FlashMLA Have?
The release of FlashMLA has not only captured the interest of developers but also garnered positive responses from mainstream inference frameworks, vLLM and SGLang.
vLLM Integration:
The vLLM team has announced plans to integrate FlashMLA soon. Technically, FlashMLA is built on PagedAttention, making it highly compatible with vLLM's technology stack. Once integrated, FlashMLA is expected to further enhance vLLM's inference performance.SGLang Adoption:
SGLang will continue utilizing the already integrated FlashInferMLA, which has been evaluated to deliver performance comparable to FlashMLA.
Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.
Recommend Reading
Subscribe to my newsletter
Read articles from NovitaAI directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
