Add unit test coverage for llama_tensor_get_type#20112
Add unit test coverage for llama_tensor_get_type#20112bartowski1182 wants to merge 18 commits intoggml-org:masterfrom
Conversation
|
Regarding the +23,000 lines and 700kb from the .schema files, I wouldn't be opposed to instead suggesting maintainers run the script on mainline and then again on their changes, so then the files don't have to exist in the actual repo itself, would certainly declutter |
| #include "../src/llama-arch.h" | ||
| #include "../src/llama-model.h" | ||
| #include "../src/llama-quant.h" |
There was a problem hiding this comment.
I know you mentioned to @ddh0 about these kinds of includes and to avoid by duplicating the structs
Figured I should check, is my case a special one because it's being done in tests?
If not, I'm willing to back-burner this and look into the changes to the API you wanted first
There was a problem hiding this comment.
Tests are OK to include internal files. It's problematic for tools and examples because 3rd parties that use libllama will not be able to do that (include internal files) - they only work with the public API.
|
How long does it currently take to generate the snapshots? |
|
16 seconds if the models aren't cached, 3 seconds if they are timed |
9709920 to
7374378
Compare
7374378 to
36f36a1
Compare
bartowski1182
left a comment
There was a problem hiding this comment.
Adding comments to justify changes to llama-quant.cpp
|
|
||
| // result of parsing --tensor-type option | ||
| // (changes to this struct must be reflected in tools/quantize/quantize.cpp) | ||
| struct tensor_type_option { |
There was a problem hiding this comment.
Moved to llama-quant.h
| @@ -1,25 +1,18 @@ | |||
| #include "llama.h" | |||
There was a problem hiding this comment.
Moved to llama-quant.h
| #include <algorithm> | ||
| #include <cmath> | ||
| #include <cstring> | ||
| #include <string> |
There was a problem hiding this comment.
Moved to llama-quant.h
| // quantization state | ||
| // | ||
|
|
||
| struct quantize_state_impl { |
There was a problem hiding this comment.
Moved to llama-quant.h
| // | ||
|
|
||
| static bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) { | ||
| bool tensor_allows_quantization(const llama_model_quantize_params * params, llm_arch arch, const ggml_tensor * tensor) { |
There was a problem hiding this comment.
Made public so I can use it in my tests
| // | ||
|
|
||
| static ggml_type llama_ftype_get_default_type(llama_ftype ftype) { | ||
| ggml_type llama_ftype_get_default_type(llama_ftype ftype) { |
There was a problem hiding this comment.
Made public so it can be tested
| case LLAMA_FTYPE_MOSTLY_IQ3_S: | ||
| case LLAMA_FTYPE_MOSTLY_IQ3_M: return GGML_TYPE_IQ3_S; | ||
|
|
||
| default: throw std::runtime_error(format("invalid output file type %d\n", ftype)); |
There was a problem hiding this comment.
This was causing annoyance when trying to iterate through all FTYPEs, 1 by 1, so removed the throw from here and added it in place below
I could just iterate in a way where I avoid the missing middle quants (Q4_0_4_4 etc) and revert this
src/llama-quant.cpp
Outdated
| return nullptr; | ||
| } | ||
|
|
||
| void init_quantize_state_counters(quantize_state_impl & qs, const std::vector<std::string> & tensor_names) { |
There was a problem hiding this comment.
extracted the initialization of state counters so I can use them outside the main function
|
|
||
| default_type = llama_ftype_get_default_type(ftype); | ||
| ggml_type default_type = llama_ftype_get_default_type(ftype); | ||
| if (default_type == GGML_TYPE_COUNT) { |
There was a problem hiding this comment.
(this is where I added the throw back)
| // compute tensor metadata once and cache it | ||
| std::vector<tensor_metadata> metadata(tensors.size()); | ||
| for (size_t i = 0; i < tensors.size(); ++i) { | ||
| metadata[i].name = ggml_get_name(tensors[i]->tensor); |
There was a problem hiding this comment.
| metadata[i].name = ggml_get_name(tensors[i]->tensor); | |
| metadata[i].name = tensors[i]->tensor->name; |
Nitpick, not sure if it matters, but calling ggml_get_name is generally unnecessary where we can just do tensor->name.
There was a problem hiding this comment.
Interesting, this raises the question of why does ggml_get_name even exist if it just returns tensor->name ... but yeah I can make that change
| @@ -1 +1,99 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
@ggerganov probably worth grabbing your opinion here
I'm adding a LOT to this header file, including <regex>, <string>, <vector>, and llama-arch.h
Is this acceptable, or should I avoid making this a C++-header? The alternative of course would be to duplicate quantize_state_impl in my test code and then change the header to only be exposing structs/functions that are needed
There was a problem hiding this comment.
Including regex here is unfortunate, I don't think you need to expose that though, make that part opaque and hidden inside llama-quant.cpp.
There was a problem hiding this comment.
Okay I think a7132f6 should accomplish what you suggested, no more regex in the header, using a unique pointer to a new struct compiled_tensor_type_patterns so that it can live in llama-quant.cpp exclusively
Relied on Claude's assistance for this but it looks like it follows similar practices from other sections of code that also use std::unique_ptr and compiling and unit tests both work still
…inter to a new struct 'compiled_tensor_type_patterns' which contains the patterns
This is part of a larger goal of reworking or replacing the
llama_tensor_get_typefunctionBefore major work starts in that area, I want to capture the current existing behaviour thoroughly, so that any accidental changes are easy to spot, and any purposeful changes are easy to document
To that end, this PR introduces unit test coverage for the function itself
Using a pre-set list of models, and taking advantage of the new
gguf-model-datautility, these tests pull real model metadata directly from huggingface, create mock models/tensors, and runs them through thellama_tensor_get_typefunction and document the schema into thetests/snapshots/directory asmodel-name.schemaI hope that the storage method I've decided to use won't be overly burdensome, I iterated a lot through various ways to store the existing tensor layouts, and I think this landed in an acceptable compromise area: not too many files, "only" 700 KB of storage
If this is not acceptable, I can go back to the drawing board
To capture current layouts of tensors, the script can be run with the
--generateflag which will download the metadata and produce the schema files:`--generate output`
Then you can run the test itself without any arguments:
`test output`
There are a few changes to llama-quant.cpp, notably making the function itself non-static so I can access it, extracting
init_quantize_state_countersandllama_ftype_default_typeso I can use them without recreating them, and lastly I pulled in the change from @ddh0 in their PR #19770 that addstensor_allows_quantization(except I made it non-static) since I needed a function like that, so once that PR is merged (and this is rebased on latest) that change will go awayI also moved a couple things like
quantize_state_implto the header file for similar reasonsFinally, I found an issue with the
gguf-model-datagguf_read_uint32_valwhen the per-layer head counts are an array like with Step-3.5-Flash, so I've added a fix for that and a unit test to capture that behaviour