[torchao] fix safetensors and enable loading from sharded files #41998

liangel-02 · 2025-11-03T17:42:33Z

Context

This PR is a followup to #40735 and #41138. Previously, we enabled safetensors in torchao for one shard file. This PR fixes some errors introduced in #41138 and handles the case when checkpoints are sharded onto more than one file, including the edge case where a single quantized tensor (ie Float8Tensor) is sharded onto two different files (ie qdata on one and scale on another).

Summary

If we are loading in a component of a tensor subclass in create_quantized_param() called by _load_state_dict_into_meta_model(), we add this as a new parameter into the model. Then after all parameters are loaded, we unflatten the state_dict and reassign the model parameters.

Testing

Modified unit tests to test all tensor subclasses
python tests/quantization/torchao_integration/test_torchao.py -k TorchAoSafeSerializationTest

src/transformers/quantizers/quantizer_torchao.py

jerryzh168

thanks, looks good mostly, had one more inline comment

SunMarc

Thanks for your work ! Left a couple of comments. Btw, we will soon refactor how quantization is applied as we move to dynamic weights loading like vllm. This should help getting support for features like TP

src/transformers/modeling_utils.py

src/transformers/quantizers/base.py

SunMarc · 2025-11-04T10:07:54Z

src/transformers/quantizers/quantizer_torchao.py

+        if TORCHAO_VERSION >= version.parse("0.14.0") and is_metadata_torchao(self.metadata):
+            updated_state_dict = unflatten_tensor_state_dict(model.state_dict(), metadata)
+
+            weights_to_register = set(updated_state_dict.keys())
+
+            for name, param in list(model.named_parameters()):
+                module_fqn, weight_name = name.rsplit(".", 1)
+                module = model.get_submodule(module_fqn)
+                weight = getattr(module, weight_name)
+
+                device = weight.device
+                requires_grad = weight.requires_grad
+
+                if "_weight_" in weight_name:
+                    delattr(module, weight_name)
+
+                if name in weights_to_register:
+                    new_param_value = updated_state_dict[name]
+                    new_param = torch.nn.Parameter(new_param_value.to(device), requires_grad=requires_grad)
+                    module.register_parameter(weight_name, new_param)
+
+                    weights_to_register.remove(name)
+
+            model.load_state_dict(updated_state_dict, strict=False)


so instead of performing unflatten_tensor_state_dict in create_quantized_param, we do it here at the very end and we just store the flattened weights in the module?

yeah, we don't want to do it in create_quantized_param since at most, we'd only have access to one shard file, and we want to consider the case where tensor subclass attributes are split up over multiple files

we call unflatten_tensor_state_dict at the very end to get the recovered state dict, and then iterate through the model and replace the weights that represent the tensor attributes with the entire tensor subclass.

github-actions · 2025-11-13T21:24:54Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: torchao_integration

liangel-02 · 2025-11-13T22:11:10Z

@SunMarc i rebased my pr and now am seeing this error due to #41580

File "/home/liangel/local/transformers/src/transformers/modeling_utils.py", line 4122, in from_pretrained
    model, missing_keys, unexpected_keys, mismatched_keys, offload_index, error_msgs = cls._load_pretrained_model(
                                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/liangel/local/transformers/src/transformers/modeling_utils.py", line 4275, in _load_pretrained_model
    missing_keys, unexpected_keys, mismatched_keys, misc = convert_and_load_state_dict_in_model(
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/liangel/local/transformers/src/transformers/core_model_loading.py", line 621, in convert_and_load_state_dict_in_model
    raise ValueError("This quantization method is gonna be supported SOOOON")
ValueError: This quantization method is gonna be supported SOOOON

will there be follow up changes to support torchao/what changes would be needed? cc @jerryzh168

ArthurZucker · 2025-11-14T13:28:51Z

Happy to help if you want to do the changes here, I think @SunMarc and @MekkCyber are gonna be helping as well making sure torchao is supported!

SunMarc · 2025-11-14T15:12:39Z

We just merged a big PR and all quantization methods are impacted, we will add back the support for those methods asap !

liangel-02 marked this pull request as draft November 3, 2025 17:43

liangel-02 force-pushed the torchao-safetensors-sharding branch from 8b6b802 to eeb8451 Compare November 3, 2025 17:54

liangel-02 marked this pull request as ready for review November 3, 2025 18:32

github-actions bot requested review from MekkCyber and SunMarc November 3, 2025 18:33

jerryzh168 reviewed Nov 3, 2025

View reviewed changes