note: This paper was written with AI, and the exploration it describes was done collaboratively with AI. The "we" described here is us; me and a few models.

With llama.cpp model quantization, properly adjusting models to keep their performance after reducing precision is a complex but valuable task. The GGUF quantization method efficiently compresses models by lowering the bit-width of parameters, similar to many other methods, but it offers a way to avoid issues common in fully automatic processes. Quantization can introduce errors due to the reduced information capacity of lower-bit representations, so effective calibration (or its equivalent for GGUF) is necessary to ensure the quantized model performs closely to the original.

The GGUF format's tool for improving quantization, similar to calibration, uses an importance matrix (or "iMatrix"). This is a structure, derived from the logits generated by the model with a dataset, that guides the quantization process by highlighting critical areas within the model's parameters. By identifying where quantization often causes significant differences from the original model, the iMatrix allows for focused adjustments, reducing errors in these important areas. Essentially, it gives the model "importance-weighted" data, helping to keep it aligned with the original model by ensuring accuracy in the most crucial sections.

The relationship between calibration and GGUF’s importance matrix (iMatrix) is that both aim to reduce quantization errors, but they do so differently. Calibration usually adjusts the model's parameters to fix errors caused by lower-bit quantization, often fine-tuning the model after quantization to match the original model more closely. On the other hand, GGUF’s iMatrix approach identifies and highlights important areas during quantization, guiding the process by marking certain data as "important" where errors could have a big impact. Instead of adjusting the model after quantization, the iMatrix helps direct the quantization process from the start, maintaining accuracy in key areas.

The llama.cpp discussions we reviewed focused on strategies for choosing the most effective data to create an iMatrix that best supports quantization. The discussions emphasized identifying data segments where quantization causes the most significant errors, particularly by analyzing KL-divergence. By selecting high-divergence chunks (outlier textual sections) for the iMatrix, the aim was to prioritize these "high-impact" areas. This ensures that the quantization process preserves accuracy in the model's most sensitive regions. It is intended to guide the selection of data to enhance the iMatrix's role in maintaining model fidelity after quantization. They also covered other topics, including the order of the data used to create the iMatrix and the importance of window size used in the process. This paper serves as an analysis of those discussions.

In our analysis, similar to the initial discussions, we use KL-divergence (Kullback–Leibler divergence) to evaluate how quantization impacts model accuracy. KL-divergence measures the difference between the output probabilities of the quantized model and the original (baseline) model. In this context, it helps us see how well the quantized model matches the probabilistic predictions of the unquantized version. Lower KL-divergence values indicate that the quantized model's outputs are closely aligned with the baseline, showing minimal distortion introduced by quantization.

KL-divergence is especially effective for quantization testing. It allows us to see the average difference across outputs. It also illustrates the distribution of those differences. By examining specific percentiles of KL-divergence values (such as the 90th and 95th percentiles), we can focus on the most substantial deviations or “outlier errors.” These high-percentile values reveal how the quantized model performs in cases where errors are likely to have the largest impact, providing insights into the model’s stability in handling complex or challenging inputs. In this way, KL-divergence guides optimal quantization by pin-pointing where the most significant errors occur. This can be used to ensure calibration efforts reduce these critical outlier deviations.

We review largely two llama.cpp conversations, examine their data and discussion notes about context size in GGUF quantization, to iteratively refine our understanding of what factors and methods lead to improved quantization accuracy. Several key questions guided our exploration:

Key Questions in Understanding iMatrix and Quantization Effectiveness

How does the native context size of a model affect KL-divergence in quantization?
What role does dataset selection play in calibrating iMatrix for minimizing divergence?
Is a random sampling approach valid for iMatrix calibration, or does structured, non-random data yield better results?
Are there diminishing returns in subsampling strategies for iMatrix data?

1. Native Context Size and Its Impact on KL-Divergence

We began with analyzing KL-divergence across various context sizes—512, 2048, 4096, and 8192 tokens—used in calibration. The objective was to determine if quantization error was minimized at a particular context size that aligned with the model’s operational characteristics, referred to here as the native context size. The native context size, would allow it to perform optimally by minimizing KL-divergence post-quantization. For this particular model, we tested the effects across these context sizes to evaluate which best reduced error.

Identifying Key KL-Divergence Metrics and Interpreting Lower Scores as Better

The KL-divergence metrics of quantization calibration provide insight into how closely a quantized model replicates the unquantized baseline. Lower KL-divergence values are desirable, as they indicate reduced divergence between the two models, meaning the quantized model’s outputs are closer to those of the original, higher-fidelity model.

Among the available KL-divergence values, median, KLD_95, and KLD_90 emerged as critical metrics:

Median KL-Divergence: The median value serves as an indicator of central tendency, revealing the typical level of divergence across chunks. A lower median value suggests overall stability in the quantized model's predictions, aligning closely with the baseline.
KLD_95 and KLD_90 (95th and 90th Percentiles): High-percentile values such as KLD_95 and KLD_90 capture the behavior of the model at outlier levels of divergence. These values are especially relevant, as they highlight areas where quantization errors have the most severe impact. Lower high-percentile values, therefore, imply that the model avoids significant divergence in these critical sections, helping to ensure reliable performance in high-impact scenarios.

Through our analysis, we focused on these three metrics, as lower values across them indicate a quantized model that more accurately reflects the original, minimizing significant divergence and stabilizing general output quality.

Data Analysis: Comparing Context Sizes to Determine the Native Context

With these metrics in mind, we assessed KL-divergence results across the four context sizes to pinpoint the context with the lowest scores, thus identifying the native context. Below are the key values for each context size, with the 4096 context size used as the basis for comparison since it demonstrated the lowest KL-divergence across all metrics:

Context Size	Median	% Difference from 4096	KLD_95	% Difference from 4096	KLD_90	% Difference from 4096
512	0.003271	+5.28%	0.135895	+5.55%	0.077871	+4.76%
2048	0.003527	+13.52%	0.139948	+8.69%	0.078575	+5.70%
4096	0.003107	(Baseline)	0.128744	(Baseline)	0.074335	(Baseline)
8192	0.003311	+6.57%	0.144028	+11.89%	0.079583	+7.06%

The data reveals several significant findings:

Median KL-Divergence: The native context size of 4096 tokens produces the lowest median KL-divergence at 0.003107, serving as the baseline for comparison. Shorter contexts like 512 and 2048 show higher median values by approximately 5.28% and 13.52%, respectively, while extending to 8192 tokens results in a 6.57% increase. This suggests that shorter or longer contexts introduce instability, whereas the native context size minimizes divergence at the median level.
KLD_95 (95th Percentile): At the high end, 4096 again performs best with a KLD_95 value of 0.128744. Shorter contexts such as 512 and 2048 have KLD_95 values 5.55% and 8.69% higher, respectively, while extending the context to 8192 results in an 11.89% increase. This pattern indicates that deviations from the native context size increase outlier divergence, underscoring 4096 as the most stable choice for controlling high-end errors.
KLD_90 (90th Percentile): Similarly, the 4096 context size yields the lowest KLD_90 value at 0.074335. At shorter contexts, KLD_90 values increase by 4.76% for 512 and 5.70% for 2048. The longest context, 8192, exceeds the context size of the model, so the test overshot (stressed attention in an opposite way from the smallest sizes). It shows a 7.06% increase over the baseline, further reinforcing that the native context size reduces divergence even in high-divergence percentiles.

There's a lot of noise here, but if we subtract the 4096 line and ignore the parts of the data that disagree with the overall trend, we see a gradient of values roughly 1-2% apart. The values are lowest for the test with 4096 tokens, so it best aligns with the model’s training configuration. Using the full context minimizes divergence. As context sizes deviate from 4096, error rates rise.

The native context size of 4096 achieves the lowest KL-divergence values across median and high-percentile metrics, making it the optimal range for calibration in this model. This alignment allows quantization to perform with minimal divergence, ensuring that the quantized model closely approximates the original while avoiding the instability seen with non-native contexts.

2. Dataset Selection for iMatrix Calibration

NExt we looked at the model's performance with real-world application data, emphasizing sections where quantization is most likely to introduce error. Here, we discussed the initial choice of a general-purpose dataset of psuedo-random data ("groups_merged.txt") and the subsequent move to Wikitext for analyzing high-divergence sections.

In comparing diverse datasets, one member of the conversation noted that Wikitext data had higher KL-divergence scores on average than groups_merged.txt, suggesting that Wikitext’s lower entropy may introduce more challenging sections for the model to predict accurately after quantization. This revelation led to targeted chunk selection in the dataset, aiming to include high-divergence segments in the iMatrix to strengthen the model's quantization resilience specifically in these areas.

This selective approach—identifying chunks with high divergence and excluding others from the iMatrix—helped the model perform more consistently across outliers, a valuable insight for future calibration practices. However, we realized that while this approach yielded improvements, it returns were modest, as explained below.

3. Random vs. Non-Random Sampling in iMatrix Construction

Early in the discussions, the Wikitext KL-divergence test was used as a baseline to measure the divergence between the quantized model and the original model using a structured, continuous text source. Wikitext is a lower-entropy dataset that includes natural language structure, making it a useful benchmark to observe quantization effects across a range of predictable, real-world data. By calculating KL-divergence on Wikitext, the team could identify areas where quantization introduced higher divergence, especially in structured, real-data sequences. This allowed for a more targeted approach in iMatrix construction, focusing on high-divergence sections and high-impact regions for calibration.

However in an earlier discussion at llama.cpp it was decided that random data was better; although this was ultimately successfully challenged. It was not immediately clear in this later discussion if the wikitext data was randomized or not. (This is all elaborated so intricately because of section 4.)

As a side note, activations are generated independently for each chunk of tokens. After processing a chunk, the logits are not inherently “carried over” to the next chunk by default. Here's how it works and why the logits do not persist between chunks:

Independent Logit Generation:

For each request (or chunk, as it applies to the iMatrix dataset) in the context, the model processes tokens and generates logits before selecting the next token. This creates a new set of logits, where each logit is a vector of scores showing the model's confidence for each possible next token in the vocabulary. These raw logits go through a softmax function, turning them into a probability distribution that shows how likely each token is to be the next one based on the supplied context. A token is chosen from this distribution using a specific selection strategy: in greedy decoding, the token with the highest probability is picked, while in top-k sampling, the choice is limited to the top-k tokens, adding some randomness. In tasks like response generation, the model keeps generating tokens up to a set maximum number of tokens, determined by parameters like max_tokens or num_predict.

Each request generates logits from scratch for the given tokens, without considering logits from previous requests. As the model produces tokens, it moves its context window forward, retaining the last context_size tokens in memory. This ensures that the context never exceeds the model's maximum context length.

We considered whether the Wikitext KL-divergence test might have used random sampling, as random sampling is often used as a broad approach to gauge average model performance across varied data. Random sampling provides a baseline view of quantization effects by delivering a representative mix without focusing on any specific text structure or sequence. If Wikitext had used random sampling, it would have indicated an aim to generalize the model’s quantization stability across diverse data types, rather than targeting specific patterns or high-impact areas.

Any measurable difference in the choice between random and non-random sampling for iMatrix calibration highlights the importance of structured data in addressing the real challenges of quantization. Additional context revealed that the Wikitext test was structured and followed the natural text order, allowing the model to engage with inherent data patterns more realistically. This structured, non-random approach provided a clearer view of where quantization introduces errors in actual data sequences, unlike randomized sampling, which may overlook systematic weaknesses. It appears that lower-entropy, natural language structure can provide insights into how quantization affects the model on realistic text data, allowing the team to identify high-divergence areas that could be prioritized in the iMatrix.

4. Diminishing Returns in Subsampling for iMatrix Calibration

To understand the impact of selective subsampling for iMatrix calibration, we examined the relative improvements across different iMatrix calibration strategies, each aiming to reduce KL-divergence. Specifically, we compared three setups: quantization with no iMatrix (q4_0 without calibration), quantization using top ~40k tokens with high KL-divergence, and full ~500k tokens from Wikitext with iMatrix. By comparing critical metrics—median, KLD_95, and KLD_90—across these setups, we sought to determine if targeted selection yielded meaningful improvements beyond a general-purpose dataset.

Data Comparison and Initial Findings

Each calibration setup had the following KL-divergence statistics for the key metrics (median, KLD_95, and KLD_90):

Metric	No iMatrix (q4_0)	Top ~40k Tokens, iMatrix	Full ~500k Tokens, iMatrix	Relative Improvement (Top ~40k vs. Full ~500k)
Median	0.016759	0.013641	0.013710	0.51% better with Top ~40k
KLD_95	0.169855	0.142108	0.142599	0.34% better with Top ~40k
KLD_90	0.104060	0.086210	0.086496	0.27% better with Top ~40k

These findings confirm that the presence of an iMatrix—whether targeted or generalized—yields notable improvements over the no-iMatrix baseline. However, the incremental gains when using the top ~40k high KL-divergence tokens over the full ~500k tokens were small.

Analyzing Relative Improvements to Assess Diminishing Returns

To assess whether these improvements were meaningful or indicative of diminishing returns, we examined the relative differences between the targeted top ~40k token iMatrix and the general-purpose full ~500k token iMatrix:

Median KL-Divergence:
- The top ~40k tokens showed a median score of 0.013641 compared to 0.013710 for the full ~500k tokens.
- This yields a difference of 0.51% in favor of the top ~40k tokens,

a marginal improvement that might fall within the margin of error.

KLD_95:
- The top ~40k tokens produced a KLD_95 score of 0.142108 versus 0.142599 for the full ~500k tokens.
- This results in a 0.34% improvement with the targeted selection, another small gain indicating that subsampling has minimal additional effect.
KLD_90:
- The top ~40k tokens gave a KLD_90 score of 0.086210, compared to 0.086496 with the full ~500k tokens.
- This yields a 0.27% improvement, again a minimal difference.

These small percentages indicate that the additional improvement gained by focusing on high-divergence chunks is slight, especially when compared to the more substantial improvements achieved by introducing iMatrix calibration in general. The 0.3%–0.5% incremental gain achieved through subsampling is not automatically dismissable, but it is half and order of magnitude less than the differences we were seeing in different context sizes. These small gains suggest that further refining subsampling strategies would likely yield diminishing returns and that the model’s calibration could be better enhanced through a focus on broader, high-quality datasets.

Toward Comprehensive Data

Based on this analysis, it’s clear that the iMatrix calibration contributes to an improvement, while the additional approach of selecting high KL-divergent chunks contributes only modestly. This diminishing return indicates that beyond a certain point, selective subsampling adds minimal value. Instead, increasing the quantity and quality of accurately curated, diverse data may offer more impactful gains by capturing a wider array of patterns and outliers within realistic data distributions. In summary, while targeted iMatrix selection based on high-divergence chunks offers gains, its impact is limited. Expanding to a diverse, representative dataset appears to be the more effective approach for maintaining alignment with the unquantized baseline, especially for models expected to handle varied real-world input.

On KL-Divergence and Context Size Optimization in GGUF Quantization