Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New model export [call for feedback and help] #322

Merged
merged 6 commits into from
Aug 20, 2023
Merged

Conversation

karpathy
Copy link
Owner

I took the new grouped int8 export from my int8 PR #312 and made it just a new "version 1". I think we'll be in a world with a few output .bin formats, so I'm centralizing things to a new export.py file. And this will allow us to slowly introduce new formats and deprecate the old ones over time, too.

@kroggen I think you'd want to potentially include your own export here? As it only quantizes on the level of tensors.

We'd also want to move over the Llama export and the HF export into this file, I think, and then delete those files. I have to think this through a little bit more still.

@karpathy
Copy link
Owner Author

A few more questions on my mind, as exports are very annyoing to change in the future:

  • Header design think through
  • Cache alignment think through: is version 1 well-aligned (?)
  • Tokenizer inclusion? I noticed that in llama.cpp the tokenizer is included in the .bin file. This is nice because the .bin file becomes the only needed self-contained artifact. But it's a bit more annoying to pipe through the tokenizer, it bloats the files a little bit, and aesthetically I feel like the "integer inference engine" is what a Llama model fundamentally is, and the Tokenizer is a whole separate artifact that translates strings <-> ints. So I like that separation.

My plan is to merge this first, and then think through how to merge the int8 quantization version, and maybe the other versions (cuda, etc.).

@ozabluda
Copy link
Contributor

Probably a lot can be learned from

https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md
ggerganov/ggml#302

nbytes += 1
pad = 256 - nbytes # pad the rest with zeros
assert pad >= 0
out_file.write(b'\0' * pad)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this, you don't have to keep track of nbytes, which is error-prone.

Suggested change
out_file.write(b'\0' * pad)
out_file.write(b'\0' * (-out_file.tell() % 256) )

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL thank you!

@atamurad
Copy link
Contributor

atamurad commented Aug 19, 2023

We'd also want to move over the Llama export and the HF export into this file, I think, and then delete those files. I have to think this through a little bit more still.

We can load/map Meta and HF models to model.py format and let model_export() handle the quantization and serialization part. But before that would be good to have optional (non-shared) final classifier weights in model.py. I can submit PR if you're interested (1. separate wcls in model.py, 2. export wcls, 3. update old export scripts).

UPDATE: I noticed that the model.py Transformer class has an output attribute, so no changes required in model.py.

@karpathy
Copy link
Owner Author

@atamurad yes exactly I think we should load the HF/Meta models into Transformer and then export it as usual. I didn't take a look yet at how this could be achieved. Maybe the export.py file can also have some import utilities like this.

@ozabluda great links!

@karpathy
Copy link
Owner Author

I think I am going to merge this branch to master because the previous functionality around the old model.bin files (now termed "version 0") works without issues. There is a bunch of new code for version 1,2 export that is unused, but it can be tweaked with further PRs, and slowly we can migrate the run files to use them instead (including run.c). Eventually when we're happy with v1,2,... we'll deprecate v0 and delete the legacy code.

@karpathy karpathy marked this pull request as ready for review August 20, 2023 17:06
@karpathy karpathy merged commit 8c93c7a into master Aug 20, 2023
6 checks passed
""" writes one int8 tensor to file that is open in wb mode """
d = tensor.detach().cpu().view(-1).numpy().astype(np.int8)
b = struct.pack(f'{len(d)}b', *d)
file.write(b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from #312 (comment)

Pad with zeroes to align on 64 byte boundary (128 would be better, see #312 (comment)), maybe 256 (current alignment of first weights after header) even better. This is a sketch. Same thing should be done in serialize_fp32() above and in the C files.

    file.write(b'\0' * (-file.tell() % 64) )
    file.write(b)

@kroggen
Copy link
Contributor

kroggen commented Aug 22, 2023

@kroggen I think you'd want to potentially include your own export here? As it only quantizes on the level of tensors.

I'm sorry, I am not well versed in Python/numpy/pytorch

@karpathy karpathy deleted the feature/export branch August 24, 2023 03:16
vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024
New model export (the code remains "dead" and legacy version is still the default behavior, so no breaking changes are introduced). The major benefit is a new export.py file, which we can use to centralize work on formatting: both imports and exports.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants