Large Language Models (LLMs) are revolutionizing AI, powering applications from conversational agents to automated content creation. However, the massive size and computational demands of these models pose a major hurdle when it comes to deploying them on resource-constrained devices—imagine trying to run a state-of-the-art LLM on your phone!

This post dives into Activation-aware Weight Quantization (AWQ), a powerful technique that smartly reduces the precision of LLM weights, leading to significant gains in speed and efficiency without a drastic drop in accuracy. We'll walk through why quantization is critical for LLMs, how AWQ works, and the benefits of this targeted approach.

Why Quantize LLMs?

The Need for Quantization

LLMs typically use 16-bit or even 32-bit floating-point numbers to represent their weights. While these high-precision formats ensure accuracy, they also demand substantial storage and computation:

Memory Footprint: High-precision weights translate into large model sizes, making deployment on edge devices challenging.
Computational Overhead: Running arithmetic operations on FP16 or FP32 numbers requires more power and processing resources.

Simply rounding or truncating these weights to a lower precision (e.g., 8-bit integers) can lead to a significant loss in model accuracy. This is where a more nuanced approach—Activation-aware Weight Quantization (AWQ)—comes into play.

How Does AWQ Work?

Unlike traditional methods that uniformly quantize all weights, AWQ takes an activation-aware approach. It leverages the insight that not all weights contribute equally to the model’s performance.

Activation Analysis: Unmasking the Important Weights

The core idea behind AWQ is that a small subset of weights, the "VIP" weights, have a disproportionate impact on the model's output. AWQ performs activation analysis as follows:

Representative Data:
A small, representative dataset (or even a subset of your training data) is fed into the model. The goal is to trigger typical activations that the model would see during inference.
Activation Recording:
As data flows through the network, AWQ records the activations at each layer. These activations serve as a signal, indicating which neurons are firing strongly and frequently.
Importance Estimation:
By analyzing the distribution and magnitude of these activations, AWQ identifies the weights associated with high-activation neurons. These weights are considered more critical to the model’s performance, while weights linked to lower or infrequent activations are deemed less important.

Selective Quantization: Targeted Precision for Optimal Performance

Once the important weights have been identified, AWQ applies selective quantization:

Less Important Weights:
Weights connected to neurons with lower activations are quantized to lower precision (typically 4-bit or 8-bit integers).
Important Weights:
For weights that are crucial to the model's output, AWQ either maintains a higher precision (e.g., keeping them at FP16 or using 8-bit if originally FP16) or applies a more refined quantization strategy. This targeted approach ensures that the most significant parameters retain the accuracy needed to preserve the model's performance.

The Quantization Process Simplified

At its core, AWQ involves three main steps:

Scaling:
The weights are first scaled to a specific range that is more suitable for lower precision representation. This scaling ensures that the quantization process operates within a well-behaved numeric range.
Rounding:
Once scaled, the weights are rounded to the nearest quantized value. For example, if you're converting to 8-bit integers, each weight is rounded to the nearest integer within the representable range.
Dequantization:
During inference, the quantized weights are converted back (or “dequantized”) to approximate their original values. This step is crucial as it enables the hardware to perform computations using higher-precision arithmetic, compensating for the quantization.

Maths behind it!!!

Quantizing Large Language Models with AWQ

Large Language Models (LLMs) require significant computational resources, making deployment on edge devices challenging. Activation-aware Weight Quantization (AWQ) helps by reducing model size while maintaining accuracy. Below, we'll walk through the quantization of Mistral-7B using AWQ.

1️⃣ Install Dependencies

First, ensure you have the required libraries installed. AWQ is not available via PyPI, so we install it directly from GitHub.

pip install torch transformers accelerate
pip install git+https://github.com/mit-han-lab/llm-awq.git

2️⃣ Load the Pretrained Model

We begin by loading the Mistral-7B Instruct model from Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'

raw_model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

3️⃣ Define Quantization Configurations

AWQ uses selective quantization to preserve high-importance weights while reducing others to low-bit integers (e.g., 4-bit). Below is a standard AWQ quantization config:

quant_config = { 
    "zero_point": False, 
    "q_group_size": 64, 
    "w_bit": 4, 
    "version": "GEMM"
}

zero_point: Controls whether zero-point quantization is used.
q_group_size: Defines the size of weight groups for quantization.
w_bit: Specifies the bit-width for quantization (e.g., 4-bit).
version: Optimized backend (e.g., GEMM for performance).

4️⃣ Apply AWQ Quantization

Now, we quantize the model using the AWQ method.

q_model = AutoAWQForCausalLM.from_pretrained(
    model_path, low_cpu_mem_usage=False, use_cache=False
)
q_model.quantize(tokenizer, quant_config=quant_config)

This process reduces the model size while ensuring high accuracy.

5️⃣ Save the Quantized Model

After quantization, we can save the model for later use.

quant_path = 'mistral-instruct-v0.2-awq'

q_model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

6️⃣ Compare Model Sizes

Let’s compare the number of parameters before and after quantization.

total_params_raw = sum(p.numel() for p in raw_model.parameters())
total_params_q = sum(p.numel() for p in q_model.parameters())

print("Total Parameters (FP16 Model):", total_params_raw)
print("Total Parameters (Quantized Model):", total_params_q)

output:

Total Parameters (FP16 Model): 7,249,414,144
Total Parameters (Quantized Model): 7,249,414,144

AWQ does not reduce the number of parameters; it only reduces the bit precision of the weights.
The storage size decreases significantly (e.g., from FP16 → INT4, reducing memory usage by 75%), but the number of parameters remains unchanged.

Evaluating the Performance of Quantized Models

Once we have successfully quantized the Mistral-7B model using AWQ, the next step is to evaluate how well it performs compared to the original floating-point model. We will run inference on both versions and compare their outputs, latency, and memory usage.

7️⃣ Running Inference on the Quantized Model

To test the quantized model, we generate text from a simple prompt:

import torch

quantized_model = AutoAWQForCausalLM.from_quantized(quant_path, device="cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(quant_path)

prompt = "Explain the significance of quantum computing in AI."

inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

with torch.no_grad():
    output_tokens = quantized_model.generate(**inputs, max_new_tokens=100)

generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print("Generated Text:\n", generated_text)

8️⃣ Comparing Speed and Memory Usage

To understand the benefits of quantization, we compare the latency and memory footprint of the original model vs. the quantized version.

import time

device = "cuda" if torch.cuda.is_available() else "cpu"
raw_model.to(device)
quantized_model.to(device)

def benchmark_model(model, tokenizer, prompt, num_runs=3):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    start_time = time.time()
    for _ in range(num_runs):
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=50)
    end_time = time.time()

    avg_time = (end_time - start_time) / num_runs
    return avg_time

prompt = "How does deep learning work?"
num_runs = 5

raw_latency = benchmark_model(raw_model, tokenizer, prompt, num_runs)
quantized_latency = benchmark_model(quantized_model, tokenizer, prompt, num_runs)

print(f"Latency (FP16 model): {raw_latency:.3f} sec")
print(f"Latency (Quantized model): {quantized_latency:.3f} sec")

Output:

1️⃣ Text Generation Output

After running the quantized Mistral-7B model on the input prompt:

Prompt: Explain the significance of quantum computing in AI.
Generated Text: 
Quantum computing holds the potential to revolutionize artificial intelligence by enabling exponentially faster computations for certain tasks. 
Unlike classical computers that process bits as either 0 or 1, quantum computers leverage qubits, allowing for parallel computations. 
This can significantly enhance optimization problems, cryptography, and AI model training, particularly in areas requiring complex simulations.

2️⃣ Speed & Latency Comparison

Model Version	Latency (seconds)	Model Size (GB)
FP16 Model	2.30 sec	~13.4 GB
AWQ Quantized Model (4-bit)	1.12 sec	~3.5 GB

50%+ Speed Improvement in inference time.
~75% Reduction in model size (13.4GB → 3.5GB).
Maintains similar text generation quality with minor accuracy trade-offs.
Enables deployment on lower-end GPUs (e.g., RTX 3060 12GB).
Reduces memory footprint, making LLMs feasible for edge devices.
Preserves crucial model weights using activation-aware quantization.

Benefits of AWQ

Improved Inference Speed and Efficiency

Smaller Model Size:
By reducing the precision of non-critical weights, AWQ can shrink the overall model size dramatically, making LLMs more accessible for deployment on edge devices.
Faster Computation:
Lower-bit arithmetic operations are inherently faster than their high-precision counterparts. This can lead to significant improvements in inference latency.

Maintaining Accuracy

Targeted Precision:
By preserving the precision of the most influential weights, AWQ minimizes the impact on overall model accuracy. This means you get the best of both worlds: a smaller, faster model that still performs well on real-world tasks.

Enhanced Resource Efficiency

Lower Power Consumption:
With reduced memory bandwidth requirements and lower computational overhead, AWQ enables energy-efficient operation—critical for mobile and embedded applications.

Real-World Implications and Future Directions

The activation-aware approach of AWQ is a game-changer for deploying LLMs in environments where resources are limited. It opens the door for innovative applications on smartphones, IoT devices, and other edge platforms that were previously impractical due to hardware constraints.

As research continues, we can expect further refinements in quantization techniques, including adaptive mixed-precision methods that fine-tune quantization levels even more dynamically. This evolution will further bridge the gap between high-performing LLMs and real-world deployment scenarios.

Conclusion

Activation-aware Weight Quantization (AWQ) represents a significant advancement in the quest to make Large Language Models more efficient and deployable across a variety of platforms. By intelligently analyzing activations and applying selective precision, AWQ achieves an impressive balance between speed, efficiency, and accuracy. This makes it an essential technique for anyone looking to harness the power of LLMs in resource-constrained environments.

Unlock the potential of your LLMs and push the boundaries of AI deployment—experiment with AWQ and see how it can transform your applications!

Unlocking Efficient LLM Deployment with Activation-Aware Weight Quantization (AWQ)

Table of contents