-
Notifications
You must be signed in to change notification settings - Fork 11.6k
convert : ability to lazy-load safetensors remotely without downloading to disk #12820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I added python convert_hf_to_gguf.py --remote ngxson/TEST-Tiny-Llama4
# output file: ngxson-TEST-Tiny-Llama4-f16.gguf The tokenizer and config files will be downloaded to HF cache directory For now, since safetensors is not loaded, the command above will produce a GGUF with 0 tensors |
gguf-py/gguf/utility.py
Outdated
offset_start_relative, offset_end_relative = meta["data_offsets"] | ||
size = offset_end_relative - offset_start_relative | ||
offset_start = data_start_offset + offset_start_relative | ||
res[name] = (dtype, shape, offset_start, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this should have all information needed.
A dataclass
(or even a NamedTuple
) for remote tensors would be useful since there will also need to be a function to turn that either into a Numpy ndarray
or a PyTorch Tensor
, whichever is simpler at first.
A lazy tensor is built from metadata and a function which produces the tensor which will be called only when the data is needed (and only once per tensor).
With such a function, it should be simpler to add a from_remote_tensor
method to LazyTorchTensor
, although to map the safetensors types into PyTorch types, it could be simpler to let that function live in LazyTorchTensor
, and only expose a dataclass
or NamedTuple
for remote tensors and let LazyTorchTensor.from_remote_tensor
handle the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for the confirmation. Could you please go ahead and implement the from_remote_tensor
? Feel free to push directly to this PR, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please go ahead and implement the
from_remote_tensor
? Feel free to push directly to this PR, thanks!
I will, once I get somewhere more convenient (currently commuting in public transit).
It's a bit slow for now since everything is blocking and single-threaded.
@ngxson I've tested this with https://huggingface.co/SpectraSuite/FloatLM_99M by using $ python3 convert_hf_to_gguf.py SpectraSuite/FloatLM_99M --remote --outfile /path/to/somewhere/FloatLM-99M-remote-{FTYPE}.gguf It's a bit slow to convert (even if it's a small model) because the convert script is single-threaded and is blocking (waits when writing a tensor, when downloading, etc.). The resulting model works (tested with |
I think it's not a very big concern right now, I imagine it should take the same time as firstly download it locally, then run the conversion. Also, I think we can simply add a thread pool
Ok I'm looking into this rn |
|
I tested with
@bartowski1182 Do you wanna give a try with a bigger model? Edit: I'm testing llama 4 maverick on my side:
|
Yes, I've been looking into that. The main bottleneck is indeed (And lazy tensors only evaluate their source (once and caches it) when their data is needed and so parrallel writing also equals parallel fetching)
Nice. There's still some differences in metadata between a remote-converted model and a local one, like |
The I think it works by taking the last word in model name, for example: |
headers["Authorization"] = f"Bearer {os.environ['HF_TOKEN']}" | ||
if size > -1: | ||
headers["Range"] = f"bytes={start}-{start + size}" | ||
response = requests.get(url, allow_redirects=True, headers=headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea could be to have this LOC multithreaded if the size
pass a certain threshold, but I'll have a look on this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 42fc895
(I did a vibe code with gemini 2.5 pro)
Another approach: I think tensor data can be written out-of-order, then we will correct the list of tensors in metadata, then finally write the metadata to file. But this may make the list of tensors quite ugly when printing with |
FYI: when I've tried this branch with my local downloaded Maverick, I've got:
With the --remote model use it's started producing the f16.gguf file it's still at 0%
|
For the first error, you're running an old version of the script (before llama 4 pr is merged) For the problem it stucking at 0%, yes it is slow, the first tensor of Maverick is 21GB |
@ngxson No need to correct the list of tensors if the tensors are written in the correct locations, even if out of order. The main problem is |
First tensor large +1 |
I successfully converted the model, but we're running out of GPU on HF infra, so unfortunately I can't test it. Will try uploading my version to HF so someone with the right hardware can give it a try |
@ngxson You converted Maverick, right? |
@csabakecskemeti I'm running llama-quantize, when it's done, it will be uploaded to https://huggingface.co/ngxson/Llama-4-Maverick-17B-128E-Instruct-Q2_K-GGUF |
@ngxson your version is also fine: |
This reverts commit 42fc895.
gguf-py/gguf/utility.py
Outdated
if not parsed_url.scheme or not parsed_url.netloc: | ||
raise ValueError(f"Invalid URL: {url}") | ||
|
||
headers = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can still be useful to specify the User-Agent as in the _get_request_headers
method from 42fc895
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing, done in e8b7d26
@compilade Do you think we can merge this now? I kinda need this to test some new models, would be nice to have this feature merged even if it's a bit slow. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can merge this now? I kinda need this to test some new models, would be nice to have this feature merged even if it's a bit slow. Thanks!
Right, since this is gated behind a flag explicitly marked "Experimental", I think we can fix the remaining problems in a future PR (likely in #12837).
@ngxson Why is this the case? Using the safetensors inspector on HuggingFace shows that the first tensor is |
Simple, just look at the file size, the 21GB file only has one single tensor |
@ddh0 also I think you are looking at the wrong model. Maverick has the first tensor with shape of (128, 5120, 16384) which is 10.7B params in total |
…ng to disk (ggml-org#12820) * gguf util : add SafetensorRemote * fix style * convert: add --remote option * convert : allow using lazy remote tensors It's a bit slow for now since everything is blocking and single-threaded. * correct metadata.name * small style fix * support HF_TOKEN * convert : use writeable buffer for remote lazy tensors * convert : fix flake8 lint regarding lamdba assigment * multithreaded download * multithread: print debug * fix style * Revert "multithreaded download" This reverts commit 42fc895. * bring back _get_request_headers --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>
…ng to disk (ggml-org#12820) * gguf util : add SafetensorRemote * fix style * convert: add --remote option * convert : allow using lazy remote tensors It's a bit slow for now since everything is blocking and single-threaded. * correct metadata.name * small style fix * support HF_TOKEN * convert : use writeable buffer for remote lazy tensors * convert : fix flake8 lint regarding lamdba assigment * multithreaded download * multithread: print debug * fix style * Revert "multithreaded download" This reverts commit 42fc895. * bring back _get_request_headers --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>
…ng to disk (ggml-org#12820) * gguf util : add SafetensorRemote * fix style * convert: add --remote option * convert : allow using lazy remote tensors It's a bit slow for now since everything is blocking and single-threaded. * correct metadata.name * small style fix * support HF_TOKEN * convert : use writeable buffer for remote lazy tensors * convert : fix flake8 lint regarding lamdba assigment * multithreaded download * multithread: print debug * fix style * Revert "multithreaded download" This reverts commit 42fc895. * bring back _get_request_headers --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>
…ng to disk (ggml-org#12820) * gguf util : add SafetensorRemote * fix style * convert: add --remote option * convert : allow using lazy remote tensors It's a bit slow for now since everything is blocking and single-threaded. * correct metadata.name * small style fix * support HF_TOKEN * convert : use writeable buffer for remote lazy tensors * convert : fix flake8 lint regarding lamdba assigment * multithreaded download * multithread: print debug * fix style * Revert "multithreaded download" This reverts commit 42fc895. * bring back _get_request_headers --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Ref comment: #12791 (comment)
@compilade What I was able to do here is to add a class
SafetensorRemote
that have everything you need to read safetensors file remotely.But I have no idea how to plug this into
LazyTorchTensor
. Could you please have a look? Thanks!