Add unit test coverage for llama_tensor_get_type by bartowski1182 · Pull Request #20112 · ggml-org/llama.cpp

bartowski1182 · 2026-03-04T18:57:30Z

This is part of a larger goal of reworking or replacing the llama_tensor_get_type function

Before major work starts in that area, I want to capture the current existing behaviour thoroughly, so that any accidental changes are easy to spot, and any purposeful changes are easy to document

To that end, this PR introduces unit test coverage for the function itself

Using a pre-set list of models, and taking advantage of the new gguf-model-data utility, these tests pull real model metadata directly from huggingface, create mock models/tensors, and runs them through the llama_tensor_get_type function and document the schema into the tests/snapshots/ directory as model-name.schema

I hope that the storage method I've decided to use won't be overly burdensome, I iterated a lot through various ways to store the existing tensor layouts, and I think this landed in an acceptable compromise area: not too many files, "only" 700 KB of storage

If this is not acceptable, I can go back to the drawing board

To capture current layouts of tensors, the script can be run with the --generate flag which will download the metadata and produce the schema files:

`--generate output`

./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y
Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3-0.6b.schema
Fetching model metadata for GLM-4.6V from ggml-org/GLM-4.6V-GGUF...
[truncated]
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-397b-a17b.schema
Fetching model metadata for Qwen3.5-27B from bartowski/Qwen_Qwen3.5-27B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

Then you can run the test itself without any arguments:

`test output`

./build/bin/test-quant-type-selection
=== Qwen3-0.6B ===
Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
  PASS  Qwen3-0.6B: 33/33 ftype sections passed (198 tensors)
[truncated]
=== Qwen3.5-27B ===
Fetching model metadata for Qwen3.5-27B from bartowski/Qwen_Qwen3.5-27B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  PASS  Qwen3.5-27B: 33/33 ftype sections passed (498 tensors)

12/12 models passed

There are a few changes to llama-quant.cpp, notably making the function itself non-static so I can access it, extracting init_quantize_state_counters and llama_ftype_default_type so I can use them without recreating them, and lastly I pulled in the change from @ddh0 in their PR #19770 that adds tensor_allows_quantization (except I made it non-static) since I needed a function like that, so once that PR is merged (and this is rebased on latest) that change will go away

I also moved a couple things like quantize_state_impl to the header file for similar reasons

Finally, I found an issue with the gguf-model-data gguf_read_uint32_val when the per-layer head counts are an array like with Step-3.5-Flash, so I've added a fix for that and a unit test to capture that behaviour

bartowski1182 · 2026-03-04T19:15:35Z

Regarding the +23,000 lines and 700kb from the .schema files, I wouldn't be opposed to instead suggesting maintainers run the script on mainline and then again on their changes, so then the files don't have to exist in the actual repo itself, would certainly declutter

bartowski1182 · 2026-03-09T16:17:07Z

tests/test-quant-type-selection.cpp

+#include "../src/llama-arch.h"
+#include "../src/llama-model.h"
+#include "../src/llama-quant.h"


@ggerganov

I know you mentioned to @ddh0 about these kinds of includes and to avoid by duplicating the structs

Figured I should check, is my case a special one because it's being done in tests?

If not, I'm willing to back-burner this and look into the changes to the API you wanted first

Tests are OK to include internal files. It's problematic for tools and examples because 3rd parties that use libllama will not be able to do that (include internal files) - they only work with the public API.

ggerganov · 2026-03-09T19:07:35Z

How long does it currently take to generate the snapshots?

bartowski1182 · 2026-03-09T19:16:55Z

16 seconds if the models aren't cached, 3 seconds if they are

timed

time ./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y

Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: downloading 2097152 bytes from Qwen3-0.6B-Q8_0.gguf
...
gguf_fetch: downloading 16777216 bytes from Qwen_Qwen3.5-27B-Q8_0.gguf
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

real    0m16.441s
user    0m5.363s
sys     0m0.713s

time ./build/bin/test-quant-type-selection --generate
This will overwrite all snapshot files in:
  /home/colin/git_repos/forks/quant-refactor/tests/snapshots
Continue? [y/N] y

Fetching model metadata for Qwen3-0.6B from ggml-org/Qwen3-0.6B-GGUF...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//ggml-org_Qwen3-0.6B-GGUF--Qwen3-0.6B-Q8_0.gguf.partial
...
gguf_fetch: loaded from cache: /home/colin/.cache/llama.cpp/gguf-headers//bartowski_Qwen_Qwen3.5-27B-GGUF--Qwen_Qwen3.5-27B-Q8_0.gguf.partial
  wrote /home/colin/git_repos/forks/quant-refactor/tests/snapshots/qwen3.5-27b.schema
12 files written

real    0m3.010s
user    0m0.973s
sys     0m0.098s

bartowski1182

Adding comments to justify changes to llama-quant.cpp

bartowski1182 · 2026-03-10T19:24:50Z

src/llama-quant.cpp


-// result of parsing --tensor-type option
-// (changes to this struct must be reflected in tools/quantize/quantize.cpp)
-struct tensor_type_option {


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:24:58Z

src/llama-quant.cpp

@@ -1,25 +1,18 @@
-#include "llama.h"


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:25:21Z

src/llama-quant.cpp

+#include <algorithm>
 #include <cmath>
 #include <cstring>
-#include <string>


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:25:30Z

src/llama-quant.cpp

-// quantization state
-//
-
-struct quantize_state_impl {


Moved to llama-quant.h

bartowski1182 · 2026-03-10T19:25:43Z

src/llama-quant.cpp

 //

-static bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) {
+bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) {


Made public so I can use it in my tests

src/llama-quant.cpp

bartowski1182 · 2026-03-10T19:26:25Z

src/llama-quant.cpp

 //

-static ggml_type llama_ftype_get_default_type(llama_ftype ftype) {
+ggml_type llama_ftype_get_default_type(llama_ftype ftype) {


Made public so it can be tested

bartowski1182 · 2026-03-10T19:27:23Z

src/llama-quant.cpp

        case LLAMA_FTYPE_MOSTLY_IQ3_S:
        case LLAMA_FTYPE_MOSTLY_IQ3_M:   return GGML_TYPE_IQ3_S;

-        default: throw std::runtime_error(format("invalid output file type %d\n", ftype));


This was causing annoyance when trying to iterate through all FTYPEs, 1 by 1, so removed the throw from here and added it in place below

I could just iterate in a way where I avoid the missing middle quants (Q4_0_4_4 etc) and revert this

bartowski1182 · 2026-03-10T19:28:03Z

src/llama-quant.cpp

+    return nullptr;
+}
+
+void init_quantize_state_counters(quantize_state_impl & qs, const std::vector<std::string> & tensor_names) {


extracted the initialization of state counters so I can use them outside the main function

bartowski1182 · 2026-03-10T19:28:20Z

src/llama-quant.cpp


-    default_type = llama_ftype_get_default_type(ftype);
+    ggml_type default_type = llama_ftype_get_default_type(ftype);
+    if (default_type == GGML_TYPE_COUNT) {


(this is where I added the throw back)

src/llama-quant.cpp

ddh0 · 2026-03-11T20:41:33Z

src/llama-quant.cpp

+    // compute tensor metadata once and cache it
+    std::vector<tensor_metadata> metadata(tensors.size());
+    for (size_t i = 0; i < tensors.size(); ++i) {
+        metadata[i].name = ggml_get_name(tensors[i]->tensor);


Suggested change

metadata[i].name = ggml_get_name(tensors[i]->tensor);

metadata[i].name = tensors[i]->tensor->name;

Nitpick, not sure if it matters, but calling ggml_get_name is generally unnecessary where we can just do tensor->name.

Interesting, this raises the question of why does ggml_get_name even exist if it just returns tensor->name ... but yeah I can make that change

bartowski1182 · 2026-03-12T17:30:51Z

src/llama-quant.h

@@ -1 +1,99 @@
 #pragma once


@ggerganov probably worth grabbing your opinion here

I'm adding a LOT to this header file, including <regex>, <string>, <vector>, and llama-arch.h

Is this acceptable, or should I avoid making this a C++-header? The alternative of course would be to duplicate quantize_state_impl in my test code and then change the header to only be exposing structs/functions that are needed

@CISC in case you also have an opinion

Including regex here is unfortunate, I don't think you need to expose that though, make that part opaque and hidden inside llama-quant.cpp.

Okay I think a7132f6 should accomplish what you suggested, no more regex in the header, using a unique pointer to a new struct compiled_tensor_type_patterns so that it can live in llama-quant.cpp exclusively

Relied on Claude's assistance for this but it looks like it follows similar practices from other sections of code that also use std::unique_ptr and compiling and unit tests both work still

…inter to a new struct 'compiled_tensor_type_patterns' which contains the patterns

github-actions bot added the testing Everything test related label Mar 4, 2026

bartowski1182 commented Mar 9, 2026

View reviewed changes

ggerganov mentioned this pull request Mar 9, 2026

llama-quant : fail early on missing imatrix, refactor type selection, code cleanup #19770

Merged

bartowski1182 force-pushed the llama-quant-refactor branch from 9709920 to 7374378 Compare March 10, 2026 14:45

bartowski1182 added 8 commits March 10, 2026 11:15

Add unit test coverage for llama_tensor_get_type

d6c9a63

Fix merge conflicts, add more schemas

9f63d8e

clang formatter changes

d721917

Trailing whitespace

76d795b

Update name

173e57e

Start rebase

257d5b9

Updating files with upstream changes prior to rebase

174134e

Changes needed from rebase

fedab8e

Update attn_qkv schema, change throw behaviour

36f36a1

bartowski1182 force-pushed the llama-quant-refactor branch from 7374378 to 36f36a1 Compare March 10, 2026 15:24

bartowski1182 added 2 commits March 10, 2026 11:45

Fix merge conflicts

48e2d30

White space

198f234

bartowski1182 commented Mar 10, 2026

View reviewed changes

bartowski1182 added 5 commits March 11, 2026 10:38

Merge branch 'master' into llama-quant-refactor

2a4fb7c

Update with latest changes to state counters

5197229

Revert accidental personal CLAUDE.md changes

e948979

Change quotation mark

99601b3

Reuse metadata.name since we have it

60f7003

ddh0 reviewed Mar 11, 2026

View reviewed changes

src/llama-quant.cpp Outdated Show resolved Hide resolved

ddh0 reviewed Mar 11, 2026

View reviewed changes

src/llama-quant.cpp Outdated Show resolved Hide resolved

ddh0 reviewed Mar 11, 2026

View reviewed changes

bartowski1182 commented Mar 12, 2026

View reviewed changes

bartowski1182 added 2 commits March 12, 2026 14:06

Move test-only stuff out of llama-quant.cpp

0341a9c

Hide the regex functionality back in llama-quant.cpp, use a unique po…

a7132f6

…inter to a new struct 'compiled_tensor_type_patterns' which contains the patterns

	metadata[i].name = ggml_get_name(tensors[i]->tensor);
	metadata[i].name = tensors[i]->tensor->name;

Conversation

bartowski1182 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

bartowski1182 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bartowski1182 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bartowski1182 commented Mar 4, 2026 •

edited

Loading

bartowski1182 commented Mar 4, 2026 •

edited

Loading

bartowski1182 commented Mar 9, 2026 •

edited

Loading

bartowski1182 Mar 12, 2026 •

edited

Loading