convert : ability to lazy-load safetensors remotely without downloading to disk #12820

ngxson · 2025-04-08T10:28:18Z

@compilade What I was able to do here is to add a class SafetensorRemote that have everything you need to read safetensors file remotely.

tensors = SafetensorRemote.get_list_tensors_hf_model(model_id)
for name, meta in tensors.items():
    dtype, shape, offset_start, size, remote_safetensor_url = meta
    # read the tensor data
    data = SafetensorRemote.get_data_by_range(remote_safetensor_url, offset_start, size)
    print(data)

But I have no idea how to plug this into LazyTorchTensor. Could you please have a look? Thanks!

ngxson · 2025-04-08T11:30:32Z

I added --remote argument which will allow specifying the HF model ID as the input path, something like this:

python convert_hf_to_gguf.py --remote ngxson/TEST-Tiny-Llama4
# output file: ngxson-TEST-Tiny-Llama4-f16.gguf

The tokenizer and config files will be downloaded to HF cache directory

For now, since safetensors is not loaded, the command above will produce a GGUF with 0 tensors

compilade · 2025-04-08T12:00:53Z

gguf-py/gguf/utility.py

+                offset_start_relative, offset_end_relative = meta["data_offsets"]
+                size = offset_end_relative - offset_start_relative
+                offset_start = data_start_offset + offset_start_relative
+                res[name] = (dtype, shape, offset_start, size)


Nice, this should have all information needed.

A dataclass (or even a NamedTuple) for remote tensors would be useful since there will also need to be a function to turn that either into a Numpy ndarray or a PyTorch Tensor, whichever is simpler at first.

A lazy tensor is built from metadata and a function which produces the tensor which will be called only when the data is needed (and only once per tensor).

With such a function, it should be simpler to add a from_remote_tensor method to LazyTorchTensor, although to map the safetensors types into PyTorch types, it could be simpler to let that function live in LazyTorchTensor, and only expose a dataclass or NamedTuple for remote tensors and let LazyTorchTensor.from_remote_tensor handle the rest.

Nice, thanks for the confirmation. Could you please go ahead and implement the from_remote_tensor? Feel free to push directly to this PR, thanks!

Could you please go ahead and implement the from_remote_tensor? Feel free to push directly to this PR, thanks!

I will, once I get somewhere more convenient (currently commuting in public transit).

It's a bit slow for now since everything is blocking and single-threaded.

convert_hf_to_gguf.py

compilade · 2025-04-08T14:33:28Z

@ngxson I've tested this with https://huggingface.co/SpectraSuite/FloatLM_99M by using

$ python3 convert_hf_to_gguf.py SpectraSuite/FloatLM_99M --remote --outfile /path/to/somewhere/FloatLM-99M-remote-{FTYPE}.gguf

It's a bit slow to convert (even if it's a small model) because the convert script is single-threaded and is blocking (waits when writing a tensor, when downloading, etc.).

The resulting model works (tested with llama-cli), but its metadata (mostly the name (in the general.name field)) is a bit weird; it seems to use some hash of the local directory used by huggingface_hub when downloading the tokenizer and config files.

ngxson · 2025-04-08T14:40:14Z

It's a bit slow to convert (even if it's a small model) because the convert script is single-threaded and is blocking (waits when writing a tensor, when downloading, etc.).

I think it's not a very big concern right now, I imagine it should take the same time as firstly download it locally, then run the conversion.

Also, I think we can simply add a thread pool write_tensors_to_file, right? (we cannot make modify_tensor multi-threaded because for some models, we expect to process tensors in the correct order)

The resulting model works (tested with llama-cli), but its metadata (mostly the name (in the general.name field)) is a bit weird; it seems to use some hash of the local directory used by huggingface_hub when downloading the tokenizer and config files.

Ok I'm looking into this rn

ngxson · 2025-04-08T14:55:52Z

general.name should now be set to the HF model ID

ngxson · 2025-04-08T15:04:52Z

I tested with HuggingFaceTB/SmolLM2-1.7B-Instruct and it works fine! Metadata looks good too:

    4 |     1 | general.architecture = "llama"                                                       
    5 |     1 | general.type = "model"                                                               
    6 |     1 | general.name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"                                 
    7 |     1 | general.finetune = "57aa3c6599c53705406c648e7acca7e11dc45ea3"                        
    8 |     1 | general.size_label = "1.7B"                                                          
    9 |     1 | general.license = "apache-2.0"                                                       
   10 |     1 | general.base_model.count = 1                                                         
   11 |     1 | general.base_model.0.name = "SmolLM2 1.7B"                                           
   12 |     1 | general.base_model.0.organization = "HuggingFaceTB"                                  
   13 |     1 | general.base_model.0.repo_url = "https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B...
   14 |     4 | general.tags = ["safetensors","onnx","transformers.js","text-gene...                 
   15 |     1 | general.languages = ["en"]                                                           
   16 |     1 | llama.block_count = 24

@bartowski1182 Do you wanna give a try with a bigger model?

Edit: I'm testing llama 4 maverick on my side:

python convert_hf_to_gguf.py --remote --outtype q8_0 meta-llama/Llama-4-Maverick-17B-128E-Instruct

compilade · 2025-04-08T15:06:44Z

Also, I think we can simply add a thread pool write_tensors_to_file, right? (we cannot make modify_tensor multi-threaded because for some models, we expect to process tensors in the correct order)

Yes, I've been looking into that. modify_tensors is quite fast anyway with lazy tensors, so it doesn't need a thread pool (it all happens when the list of tensors and shapes is logged).

The main bottleneck is indeed write_tensors_to_file, and yes, that's where a thread pool would be useful. I've been trying to figure out some thread-safe way to write multiple tensors in parallel, but first I think I'll try to at least put a queue of .tofile() in a separate thread and maybe also some prefetching.

(And lazy tensors only evaluate their source (once and caches it) when their data is needed and so parrallel writing also equals parallel fetching)

general.name should now be set to the HF model ID

Nice. There's still some differences in metadata between a remote-converted model and a local one, like general.finetune which for some reason is there for a remote FloatLM_99M but not a local one. I did not investigate why yet.

ngxson · 2025-04-08T15:12:57Z

Nice. There's still some differences in metadata between a remote-converted model and a local one, like general.finetune which for some reason is there for a remote FloatLM_99M but not a local one. I did not investigate why yet.

The general.finetune is a nice-to-have and I think no one actually using it, so probably fine to fix it in another PR.

I think it works by taking the last word in model name, for example: ABC-DEF-XYZ, then we take the XYZ part. It's handled by the code around name_parts: list[str] = model_full_name_component.split('-') in metadata.py

ngxson · 2025-04-08T15:29:33Z

gguf-py/gguf/utility.py

+            headers["Authorization"] = f"Bearer {os.environ['HF_TOKEN']}"
+        if size > -1:
+            headers["Range"] = f"bytes={start}-{start + size}"
+        response = requests.get(url, allow_redirects=True, headers=headers)


Another idea could be to have this LOC multithreaded if the size pass a certain threshold, but I'll have a look on this later

Done in 42fc895

(I did a vibe code with gemini 2.5 pro)

ngxson · 2025-04-08T15:39:47Z

The main bottleneck is indeed write_tensors_to_file, and yes, that's where a thread pool would be useful. I've been trying to figure out some thread-safe way to write multiple tensors in parallel, but first I think I'll try to at least put a queue of .tofile() in a separate thread and maybe also some prefetching.

Another approach: I think tensor data can be written out-of-order, then we will correct the list of tensors in metadata, then finally write the metadata to file.

But this may make the list of tensors quite ugly when printing with gguf-dump

csabakecskemeti · 2025-04-08T15:40:18Z

FYI: when I've tried this branch with my local downloaded Maverick, I've got:

INFO:hf-to-gguf:gguf: expert count = 128
INFO:hf-to-gguf:gguf: experts used count = 1
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
Traceback (most recent call last):
  File "/home/kecso/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1113, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/kecso/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 815, in __getitem__
    raise KeyError(key)
KeyError: 'llama4'

With the --remote model use it's started producing the f16.gguf file it's still at 0%

INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/media/kecso/ssd_4t_storage/meta-llama.Llama-4-Maverick-17B-128E-Instruct.f16.gguf: n_tensors = 531, total_size = 801.5G
Writing:   0%|                                                                                                                                            | 0.00/801G [00:00<?, ?byte/s]/home/kecso/Documents/workspace/llama.cpp/convert_hf_to_gguf.py:5419: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:1521.)
  lazy = cls(meta=meta, args=(remote_tensor,), func=lambda r: torch.frombuffer(r.data(), dtype=dtype).reshape(shape))

ngxson · 2025-04-08T15:42:44Z

For the first error, you're running an old version of the script (before llama 4 pr is merged)

For the problem it stucking at 0%, yes it is slow, the first tensor of Maverick is 21GB

compilade · 2025-04-08T15:51:50Z

Another approach: I think tensor data can be written out-of-order, then we will correct the list of tensors in metadata, then finally write the metadata to file.

@ngxson No need to correct the list of tensors if the tensors are written in the correct locations, even if out of order. The main problem is .tofile() on numpy arrays which appends at the end of the file. This can't really be used out-of-order, so some other way to write the file is needed to allow true parallelism.

csabakecskemeti · 2025-04-08T16:06:56Z

First tensor large +1
confirmed it's started producing.
(for the first error - I've merged this on the latest master... anyway will look into the conv..py later when I'll have time again :P )

ngxson · 2025-04-08T16:29:41Z

Ok the multithread cut down the conversion time of Maverick by a half, current ETA is 3hr

ngxson · 2025-04-08T21:24:04Z

I successfully converted the model, but we're running out of GPU on HF infra, so unfortunately I can't test it. Will try uploading my version to HF so someone with the right hardware can give it a try

csabakecskemeti · 2025-04-08T21:36:50Z

@ngxson You converted Maverick, right?
If there's a quant (Q2 maybe?) fit to ~100GB I can test it. Send me the link when it's up.
(I'm also converting but going very slow )

ngxson · 2025-04-08T22:14:54Z

@csabakecskemeti I'm running llama-quantize, when it's done, it will be uploaded to https://huggingface.co/ngxson/Llama-4-Maverick-17B-128E-Instruct-Q2_K-GGUF

csabakecskemeti · 2025-04-09T02:35:43Z

Works! I've tested on my version (it was faster now):

csabakecskemeti · 2025-04-09T02:47:13Z

@ngxson your version is also fine:

This reverts commit 42fc895.

compilade · 2025-04-09T13:36:40Z

gguf-py/gguf/utility.py

+        if not parsed_url.scheme or not parsed_url.netloc:
+            raise ValueError(f"Invalid URL: {url}")
+
+        headers = {}


I think it can still be useful to specify the User-Agent as in the _get_request_headers method from 42fc895

Thanks for noticing, done in e8b7d26

ngxson · 2025-04-10T09:44:57Z

@compilade Do you think we can merge this now? I kinda need this to test some new models, would be nice to have this feature merged even if it's a bit slow. Thanks!

compilade

Do you think we can merge this now? I kinda need this to test some new models, would be nice to have this feature merged even if it's a bit slow. Thanks!

Right, since this is gated behind a flag explicitly marked "Experimental", I think we can fix the remaining problems in a future PR (likely in #12837).

ddh0 · 2025-04-11T03:16:38Z

For the problem it stucking at 0%, yes it is slow, the first tensor of Maverick is 21GB

@ngxson Why is this the case? Using the safetensors inspector on HuggingFace shows that the first tensor is language_model.model.embed_tokens.weight with dimensions [202048, 5120] stored in BF16. 1,034,485,760 parameters at BF16 should be about 2GB, right?

ngxson · 2025-04-11T06:43:38Z

Simple, just look at the file size, the 21GB file only has one single tensor

ngxson · 2025-04-11T06:48:15Z

@ddh0 also I think you are looking at the wrong model. Maverick has the first tensor with shape of (128, 5120, 16384) which is 10.7B params in total

ddh0 · 2025-04-11T06:52:33Z

I see what you mean now by looking at the first shard directly.

But if you go to the model card and hit the "inspect" button (pictured) then it shows all the tensors in the model, and the one i mentioned is first. This is the source of the confusion:

…ng to disk (ggml-org#12820) * gguf util : add SafetensorRemote * fix style * convert: add --remote option * convert : allow using lazy remote tensors It's a bit slow for now since everything is blocking and single-threaded. * correct metadata.name * small style fix * support HF_TOKEN * convert : use writeable buffer for remote lazy tensors * convert : fix flake8 lint regarding lamdba assigment * multithreaded download * multithread: print debug * fix style * Revert "multithreaded download" This reverts commit 42fc895. * bring back _get_request_headers --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>

gguf util : add SafetensorRemote

2507c8e

github-actions bot added the python label Apr 8, 2025

ngxson added 2 commits April 8, 2025 12:29

fix style

7f61d0b

convert: add --remote option

08ecbbe

compilade reviewed Apr 8, 2025

View reviewed changes

convert : allow using lazy remote tensors

3a3682d

It's a bit slow for now since everything is blocking and single-threaded.

ngxson commented Apr 8, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

correct metadata.name

df95a3a

small style fix

4f65762

ngxson marked this pull request as ready for review April 8, 2025 15:04

ngxson requested a review from compilade April 8, 2025 15:05

support HF_TOKEN

b584e39

ngxson commented Apr 8, 2025

View reviewed changes

convert : use writeable buffer for remote lazy tensors

78094fc

convert : fix flake8 lint regarding lamdba assigment

4c0170e

multithreaded download

42fc895

ngxson added 2 commits April 8, 2025 18:07

multithread: print debug

63f0604

fix style

c8760cc

compilade mentioned this pull request Apr 8, 2025

convert : write tensors in parallel #12837

Open

6 tasks

Revert "multithreaded download"

2e535f6

This reverts commit 42fc895.

compilade reviewed Apr 9, 2025

View reviewed changes

bring back _get_request_headers

e8b7d26

compilade approved these changes Apr 10, 2025

View reviewed changes

ngxson merged commit 64eda5d into ggml-org:master Apr 10, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert : ability to lazy-load safetensors remotely without downloading to disk #12820

convert : ability to lazy-load safetensors remotely without downloading to disk #12820

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025 •

edited

Loading

compilade Apr 8, 2025 •

edited

Loading

ngxson Apr 8, 2025

compilade Apr 8, 2025

compilade commented Apr 8, 2025 •

edited

Loading

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025 •

edited

Loading

compilade commented Apr 8, 2025 •

edited

Loading

ngxson commented Apr 8, 2025

ngxson Apr 8, 2025

ngxson Apr 8, 2025

ngxson commented Apr 8, 2025

csabakecskemeti commented Apr 8, 2025

ngxson commented Apr 8, 2025

compilade commented Apr 8, 2025

csabakecskemeti commented Apr 8, 2025

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025

csabakecskemeti commented Apr 8, 2025 •

edited

Loading

ngxson commented Apr 8, 2025

csabakecskemeti commented Apr 9, 2025 •

edited

Loading

csabakecskemeti commented Apr 9, 2025

compilade Apr 9, 2025

ngxson Apr 9, 2025

ngxson commented Apr 10, 2025

compilade left a comment

ddh0 commented Apr 11, 2025

ngxson commented Apr 11, 2025

ngxson commented Apr 11, 2025

ddh0 commented Apr 11, 2025

convert : ability to lazy-load safetensors remotely without downloading to disk #12820

convert : ability to lazy-load safetensors remotely without downloading to disk #12820

Conversation

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025 • edited Loading

compilade Apr 8, 2025 • edited Loading

Choose a reason for hiding this comment

ngxson Apr 8, 2025

Choose a reason for hiding this comment

compilade Apr 8, 2025

Choose a reason for hiding this comment

compilade commented Apr 8, 2025 • edited Loading

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025 • edited Loading

compilade commented Apr 8, 2025 • edited Loading

ngxson commented Apr 8, 2025

ngxson Apr 8, 2025

Choose a reason for hiding this comment

ngxson Apr 8, 2025

Choose a reason for hiding this comment

ngxson commented Apr 8, 2025

csabakecskemeti commented Apr 8, 2025

ngxson commented Apr 8, 2025

compilade commented Apr 8, 2025

csabakecskemeti commented Apr 8, 2025

ngxson commented Apr 8, 2025

ngxson commented Apr 8, 2025

csabakecskemeti commented Apr 8, 2025 • edited Loading

ngxson commented Apr 8, 2025

csabakecskemeti commented Apr 9, 2025 • edited Loading

csabakecskemeti commented Apr 9, 2025

compilade Apr 9, 2025

Choose a reason for hiding this comment

ngxson Apr 9, 2025

Choose a reason for hiding this comment

ngxson commented Apr 10, 2025

compilade left a comment

Choose a reason for hiding this comment

ddh0 commented Apr 11, 2025

ngxson commented Apr 11, 2025

ngxson commented Apr 11, 2025

ddh0 commented Apr 11, 2025

ngxson commented Apr 8, 2025 •

edited

Loading

compilade Apr 8, 2025 •

edited

Loading

compilade commented Apr 8, 2025 •

edited

Loading

ngxson commented Apr 8, 2025 •

edited

Loading

compilade commented Apr 8, 2025 •

edited

Loading

csabakecskemeti commented Apr 8, 2025 •

edited

Loading

csabakecskemeti commented Apr 9, 2025 •

edited

Loading