Skip to content

convert : ability to lazy-load safetensors remotely without downloading to disk #12820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 10, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 8, 2025

Ref comment: #12791 (comment)

@compilade What I was able to do here is to add a class SafetensorRemote that have everything you need to read safetensors file remotely.

tensors = SafetensorRemote.get_list_tensors_hf_model(model_id)
for name, meta in tensors.items():
    dtype, shape, offset_start, size, remote_safetensor_url = meta
    # read the tensor data
    data = SafetensorRemote.get_data_by_range(remote_safetensor_url, offset_start, size)
    print(data)

But I have no idea how to plug this into LazyTorchTensor. Could you please have a look? Thanks!

@github-actions github-actions bot added the python python script changes label Apr 8, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

I added --remote argument which will allow specifying the HF model ID as the input path, something like this:

python convert_hf_to_gguf.py --remote ngxson/TEST-Tiny-Llama4
# output file: ngxson-TEST-Tiny-Llama4-f16.gguf

The tokenizer and config files will be downloaded to HF cache directory

For now, since safetensors is not loaded, the command above will produce a GGUF with 0 tensors

offset_start_relative, offset_end_relative = meta["data_offsets"]
size = offset_end_relative - offset_start_relative
offset_start = data_start_offset + offset_start_relative
res[name] = (dtype, shape, offset_start, size)
Copy link
Collaborator

@compilade compilade Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this should have all information needed.

A dataclass (or even a NamedTuple) for remote tensors would be useful since there will also need to be a function to turn that either into a Numpy ndarray or a PyTorch Tensor, whichever is simpler at first.

A lazy tensor is built from metadata and a function which produces the tensor which will be called only when the data is needed (and only once per tensor).

With such a function, it should be simpler to add a from_remote_tensor method to LazyTorchTensor, although to map the safetensors types into PyTorch types, it could be simpler to let that function live in LazyTorchTensor, and only expose a dataclass or NamedTuple for remote tensors and let LazyTorchTensor.from_remote_tensor handle the rest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the confirmation. Could you please go ahead and implement the from_remote_tensor? Feel free to push directly to this PR, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please go ahead and implement the from_remote_tensor? Feel free to push directly to this PR, thanks!

I will, once I get somewhere more convenient (currently commuting in public transit).

It's a bit slow for now since everything is blocking and single-threaded.
@compilade
Copy link
Collaborator

compilade commented Apr 8, 2025

@ngxson I've tested this with https://huggingface.co/SpectraSuite/FloatLM_99M by using

$ python3 convert_hf_to_gguf.py SpectraSuite/FloatLM_99M --remote --outfile /path/to/somewhere/FloatLM-99M-remote-{FTYPE}.gguf

It's a bit slow to convert (even if it's a small model) because the convert script is single-threaded and is blocking (waits when writing a tensor, when downloading, etc.).

The resulting model works (tested with llama-cli), but its metadata (mostly the name (in the general.name field)) is a bit weird; it seems to use some hash of the local directory used by huggingface_hub when downloading the tokenizer and config files.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

It's a bit slow to convert (even if it's a small model) because the convert script is single-threaded and is blocking (waits when writing a tensor, when downloading, etc.).

I think it's not a very big concern right now, I imagine it should take the same time as firstly download it locally, then run the conversion.

Also, I think we can simply add a thread pool write_tensors_to_file, right? (we cannot make modify_tensor multi-threaded because for some models, we expect to process tensors in the correct order)

The resulting model works (tested with llama-cli), but its metadata (mostly the name (in the general.name field)) is a bit weird; it seems to use some hash of the local directory used by huggingface_hub when downloading the tokenizer and config files.

Ok I'm looking into this rn

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

general.name should now be set to the HF model ID

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

I tested with HuggingFaceTB/SmolLM2-1.7B-Instruct and it works fine! Metadata looks good too:

    4 |     1 | general.architecture = "llama"                                                       
    5 |     1 | general.type = "model"                                                               
    6 |     1 | general.name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"                                 
    7 |     1 | general.finetune = "57aa3c6599c53705406c648e7acca7e11dc45ea3"                        
    8 |     1 | general.size_label = "1.7B"                                                          
    9 |     1 | general.license = "apache-2.0"                                                       
   10 |     1 | general.base_model.count = 1                                                         
   11 |     1 | general.base_model.0.name = "SmolLM2 1.7B"                                           
   12 |     1 | general.base_model.0.organization = "HuggingFaceTB"                                  
   13 |     1 | general.base_model.0.repo_url = "https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B...
   14 |     4 | general.tags = ["safetensors","onnx","transformers.js","text-gene...                 
   15 |     1 | general.languages = ["en"]                                                           
   16 |     1 | llama.block_count = 24     

@bartowski1182 Do you wanna give a try with a bigger model?


Edit: I'm testing llama 4 maverick on my side:

python convert_hf_to_gguf.py --remote --outtype q8_0 meta-llama/Llama-4-Maverick-17B-128E-Instruct

@ngxson ngxson marked this pull request as ready for review April 8, 2025 15:04
@ngxson ngxson requested a review from compilade April 8, 2025 15:05
@compilade
Copy link
Collaborator

compilade commented Apr 8, 2025

Also, I think we can simply add a thread pool write_tensors_to_file, right? (we cannot make modify_tensor multi-threaded because for some models, we expect to process tensors in the correct order)

Yes, I've been looking into that. modify_tensors is quite fast anyway with lazy tensors, so it doesn't need a thread pool (it all happens when the list of tensors and shapes is logged).

The main bottleneck is indeed write_tensors_to_file, and yes, that's where a thread pool would be useful. I've been trying to figure out some thread-safe way to write multiple tensors in parallel, but first I think I'll try to at least put a queue of .tofile() in a separate thread and maybe also some prefetching.

(And lazy tensors only evaluate their source (once and caches it) when their data is needed and so parrallel writing also equals parallel fetching)

general.name should now be set to the HF model ID

Nice. There's still some differences in metadata between a remote-converted model and a local one, like general.finetune which for some reason is there for a remote FloatLM_99M but not a local one. I did not investigate why yet.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

Nice. There's still some differences in metadata between a remote-converted model and a local one, like general.finetune which for some reason is there for a remote FloatLM_99M but not a local one. I did not investigate why yet.

The general.finetune is a nice-to-have and I think no one actually using it, so probably fine to fix it in another PR.

I think it works by taking the last word in model name, for example: ABC-DEF-XYZ, then we take the XYZ part. It's handled by the code around name_parts: list[str] = model_full_name_component.split('-') in metadata.py

headers["Authorization"] = f"Bearer {os.environ['HF_TOKEN']}"
if size > -1:
headers["Range"] = f"bytes={start}-{start + size}"
response = requests.get(url, allow_redirects=True, headers=headers)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea could be to have this LOC multithreaded if the size pass a certain threshold, but I'll have a look on this later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 42fc895

(I did a vibe code with gemini 2.5 pro)

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

The main bottleneck is indeed write_tensors_to_file, and yes, that's where a thread pool would be useful. I've been trying to figure out some thread-safe way to write multiple tensors in parallel, but first I think I'll try to at least put a queue of .tofile() in a separate thread and maybe also some prefetching.

Another approach: I think tensor data can be written out-of-order, then we will correct the list of tensors in metadata, then finally write the metadata to file.

But this may make the list of tensors quite ugly when printing with gguf-dump

@csabakecskemeti
Copy link
Contributor

FYI: when I've tried this branch with my local downloaded Maverick, I've got:

INFO:hf-to-gguf:gguf: expert count = 128
INFO:hf-to-gguf:gguf: experts used count = 1
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
Traceback (most recent call last):
  File "/home/kecso/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1113, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/kecso/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 815, in __getitem__
    raise KeyError(key)
KeyError: 'llama4'

With the --remote model use it's started producing the f16.gguf file it's still at 0%

INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/media/kecso/ssd_4t_storage/meta-llama.Llama-4-Maverick-17B-128E-Instruct.f16.gguf: n_tensors = 531, total_size = 801.5G
Writing:   0%|                                                                                                                                            | 0.00/801G [00:00<?, ?byte/s]/home/kecso/Documents/workspace/llama.cpp/convert_hf_to_gguf.py:5419: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:1521.)
  lazy = cls(meta=meta, args=(remote_tensor,), func=lambda r: torch.frombuffer(r.data(), dtype=dtype).reshape(shape))

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

For the first error, you're running an old version of the script (before llama 4 pr is merged)

For the problem it stucking at 0%, yes it is slow, the first tensor of Maverick is 21GB

@compilade
Copy link
Collaborator

Another approach: I think tensor data can be written out-of-order, then we will correct the list of tensors in metadata, then finally write the metadata to file.

@ngxson No need to correct the list of tensors if the tensors are written in the correct locations, even if out of order. The main problem is .tofile() on numpy arrays which appends at the end of the file. This can't really be used out-of-order, so some other way to write the file is needed to allow true parallelism.

@csabakecskemeti
Copy link
Contributor

First tensor large +1
confirmed it's started producing.
(for the first error - I've merged this on the latest master... anyway will look into the conv..py later when I'll have time again :P )

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

Ok the multithread cut down the conversion time of Maverick by a half, current ETA is 3hr

image

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

I successfully converted the model, but we're running out of GPU on HF infra, so unfortunately I can't test it. Will try uploading my version to HF so someone with the right hardware can give it a try

@csabakecskemeti
Copy link
Contributor

csabakecskemeti commented Apr 8, 2025

@ngxson You converted Maverick, right?
If there's a quant (Q2 maybe?) fit to ~100GB I can test it. Send me the link when it's up.
(I'm also converting but going very slow )

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

@csabakecskemeti I'm running llama-quantize, when it's done, it will be uploaded to https://huggingface.co/ngxson/Llama-4-Maverick-17B-128E-Instruct-Q2_K-GGUF

@csabakecskemeti
Copy link
Contributor

csabakecskemeti commented Apr 9, 2025

Works! I've tested on my version (it was faster now):
Screenshot from 2025-04-08 19-34-12

@csabakecskemeti
Copy link
Contributor

@ngxson your version is also fine:
Screenshot from 2025-04-08 19-46-34

if not parsed_url.scheme or not parsed_url.netloc:
raise ValueError(f"Invalid URL: {url}")

headers = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can still be useful to specify the User-Agent as in the _get_request_headers method from 42fc895

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing, done in e8b7d26

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 10, 2025

@compilade Do you think we can merge this now? I kinda need this to test some new models, would be nice to have this feature merged even if it's a bit slow. Thanks!

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can merge this now? I kinda need this to test some new models, would be nice to have this feature merged even if it's a bit slow. Thanks!

Right, since this is gated behind a flag explicitly marked "Experimental", I think we can fix the remaining problems in a future PR (likely in #12837).

@ngxson ngxson merged commit 64eda5d into ggml-org:master Apr 10, 2025
5 checks passed
@ddh0
Copy link
Contributor

ddh0 commented Apr 11, 2025

For the problem it stucking at 0%, yes it is slow, the first tensor of Maverick is 21GB

@ngxson Why is this the case? Using the safetensors inspector on HuggingFace shows that the first tensor is language_model.model.embed_tokens.weight with dimensions [202048, 5120] stored in BF16. 1,034,485,760 parameters at BF16 should be about 2GB, right?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 11, 2025

Simple, just look at the file size, the 21GB file only has one single tensor

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 11, 2025

@ddh0 also I think you are looking at the wrong model. Maverick has the first tensor with shape of (128, 5120, 16384) which is 10.7B params in total

@ddh0
Copy link
Contributor

ddh0 commented Apr 11, 2025

I see what you mean now by looking at the first shard directly.

But if you go to the model card and hit the "inspect" button (pictured) then it shows all the tensors in the model, and the one i mentioned is first. This is the source of the confusion:
Screenshot 2025-04-11 at 1 50 02 AM

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Apr 11, 2025
…ng to disk (ggml-org#12820)

* gguf util : add SafetensorRemote

* fix style

* convert: add --remote option

* convert : allow using lazy remote tensors

It's a bit slow for now since everything is blocking and single-threaded.

* correct metadata.name

* small style fix

* support HF_TOKEN

* convert : use writeable buffer for remote lazy tensors

* convert : fix flake8 lint regarding lamdba assigment

* multithreaded download

* multithread: print debug

* fix style

* Revert "multithreaded download"

This reverts commit 42fc895.

* bring back _get_request_headers

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Apr 12, 2025
…ng to disk (ggml-org#12820)

* gguf util : add SafetensorRemote

* fix style

* convert: add --remote option

* convert : allow using lazy remote tensors

It's a bit slow for now since everything is blocking and single-threaded.

* correct metadata.name

* small style fix

* support HF_TOKEN

* convert : use writeable buffer for remote lazy tensors

* convert : fix flake8 lint regarding lamdba assigment

* multithreaded download

* multithread: print debug

* fix style

* Revert "multithreaded download"

This reverts commit 42fc895.

* bring back _get_request_headers

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
…ng to disk (ggml-org#12820)

* gguf util : add SafetensorRemote

* fix style

* convert: add --remote option

* convert : allow using lazy remote tensors

It's a bit slow for now since everything is blocking and single-threaded.

* correct metadata.name

* small style fix

* support HF_TOKEN

* convert : use writeable buffer for remote lazy tensors

* convert : fix flake8 lint regarding lamdba assigment

* multithreaded download

* multithread: print debug

* fix style

* Revert "multithreaded download"

This reverts commit 42fc895.

* bring back _get_request_headers

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
timwu pushed a commit to timwu/llama.cpp that referenced this pull request May 5, 2025
…ng to disk (ggml-org#12820)

* gguf util : add SafetensorRemote

* fix style

* convert: add --remote option

* convert : allow using lazy remote tensors

It's a bit slow for now since everything is blocking and single-threaded.

* correct metadata.name

* small style fix

* support HF_TOKEN

* convert : use writeable buffer for remote lazy tensors

* convert : fix flake8 lint regarding lamdba assigment

* multithreaded download

* multithread: print debug

* fix style

* Revert "multithreaded download"

This reverts commit 42fc895.

* bring back _get_request_headers

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants