-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New model export [call for feedback and help] #322
Conversation
A few more questions on my mind, as exports are very annyoing to change in the future:
My plan is to merge this first, and then think through how to merge the int8 quantization version, and maybe the other versions (cuda, etc.). |
Probably a lot can be learned from https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md |
nbytes += 1 | ||
pad = 256 - nbytes # pad the rest with zeros | ||
assert pad >= 0 | ||
out_file.write(b'\0' * pad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do this, you don't have to keep track of nbytes
, which is error-prone.
out_file.write(b'\0' * pad) | |
out_file.write(b'\0' * (-out_file.tell() % 256) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL thank you!
… version 0 and version 1
We can load/map Meta and HF models to model.py format and let UPDATE: I noticed that the model.py Transformer class has an |
I think I am going to merge this branch to master because the previous functionality around the old model.bin files (now termed "version 0") works without issues. There is a bunch of new code for version 1,2 export that is unused, but it can be tweaked with further PRs, and slowly we can migrate the run files to use them instead (including run.c). Eventually when we're happy with v1,2,... we'll deprecate v0 and delete the legacy code. |
""" writes one int8 tensor to file that is open in wb mode """ | ||
d = tensor.detach().cpu().view(-1).numpy().astype(np.int8) | ||
b = struct.pack(f'{len(d)}b', *d) | ||
file.write(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from #312 (comment)
Pad with zeroes to align on 64 byte boundary (128 would be better, see #312 (comment)), maybe 256 (current alignment of first weights after header) even better. This is a sketch. Same thing should be done in serialize_fp32() above and in the C files.
file.write(b'\0' * (-file.tell() % 64) )
file.write(b)
I'm sorry, I am not well versed in Python/numpy/pytorch |
New model export (the code remains "dead" and legacy version is still the default behavior, so no breaking changes are introduced). The major benefit is a new export.py file, which we can use to centralize work on formatting: both imports and exports.
I took the new grouped int8 export from my int8 PR #312 and made it just a new "version 1". I think we'll be in a world with a few output .bin formats, so I'm centralizing things to a new
export.py
file. And this will allow us to slowly introduce new formats and deprecate the old ones over time, too.@kroggen I think you'd want to potentially include your own export here? As it only quantizes on the level of tensors.
We'd also want to move over the Llama export and the HF export into this file, I think, and then delete those files. I have to think this through a little bit more still.