Make cache traceable #35873

IlyasMoutawwakil · 2025-01-24T10:56:19Z

What does this PR do?

In #35792 I came to the conclusion that tensor subclassing is a process that can only be achieved currently with some restriction (e.g. get_seq_length() needs to return a tensor), adding more developer cognitive load when adding a cache class. In this PR we make the cache traceable and exportable by not being a Module and registering cache tensors as buffers directly in TorchExportableModuleWithStaticCache.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

guangy10 · 2025-01-24T21:45:49Z

src/transformers/integrations/executorch.py

        )
+        for i in range(len(self.static_cache.key_cache)):
+            self.register_buffer(f"key_cache_{i}", self.static_cache.key_cache[i], persistent=False)


Curious why is non-persistent preferred in your opinion? Probably doesn't matter too much for inference as it will always start with filling the cache with prompt tokens even if they are persistent

non-persistent buffers are not saved with the model's state_dict, I think in the case of a big model with long sequence length static cache, the cache tensors will have nonnegligible memory footprint when exporting+saving the model

HuggingFaceDocBuilderDev · 2025-01-26T20:06:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil · 2025-01-27T14:17:04Z

Tests in optimum-executorch are passing as well huggingface/optimum-executorch#4

ArthurZucker

SUper cool, IMO just missing what this enables (as in, documentation about how to use this now ) might be in optimum directly?

ArthurZucker

Also I remember we had a need for nn.Module: copy. We need to be able to copy the cache object (or clone) for prefix re-usage

ArthurZucker · 2025-01-27T14:55:21Z

Can you check that https://github.com/huggingface/huggingface-llama-recipes/blob/main/performance_optimization/prompt_reuse.py still works

IlyasMoutawwakil · 2025-01-27T15:47:54Z

Also I remember we had a need for nn.Module: copy. We need to be able to copy the cache object (or clone) for prefix re-usage

There is a test for that I saw running locally https://github.com/huggingface/transformers/blob/main/tests/utils/test_cache_utils.py#L593C1-L622C60

guangy10

Thank you for validating the changes in Optimum huggingface/optimum-executorch#4, especially considering the inconvenience caused by the disruptions due to the migration to a new repository

gante

LGTM, thank you for double-cheking the changes in many places @IlyasMoutawwakil 🤗

ArthurZucker

Let's go if this test ran locally! 🚀

tugsbayasgalan · 2025-02-19T15:46:40Z

Hey guys, i work on the torch.export team at PyTorch and i just wanted to verify that this change only works for Static kv cache right?

IlyasMoutawwakil · 2025-02-19T15:55:00Z

Hey guys, i work on the torch.export team at PyTorch and i just wanted to verify that this change only works for Static kv cache right?

Hey ! not sure I understand, the change applies to Cache class which is inherited by all cache classes. And specifically for torch.export it removes the need to have Cache a subclass of torch.nn.Module.

IlyasMoutawwakil · 2025-02-19T15:55:43Z

just updated the branch, will merge tonight after tests pass to be sure everything is good.

eljandoubi · 2025-03-17T20:07:03Z

@IlyasMoutawwakil @ArthurZucker @gante @guangy10 detach Cahce from torch.nn.Module remove the .float() method from it which causes an error when calling convert_to_fp32

IlyasMoutawwakil · 2025-03-17T22:07:32Z

self.float() should only be called if self is tensor or module, can you explain why it's called in this case ?

gante · 2025-03-19T17:23:17Z

I saw the same stack trace in another issue, but the user didn't share a script to reproduce it. @eljandoubi could you kindly share a script to reproduce the issue? 🙏

eljandoubi · 2025-03-20T12:13:03Z

@IlyasMoutawwakil the _is_fp16_bf16_tensor function checks if an object is tensor or has dtype attribute and that dtype is either fp16 or bf16.

eljandoubi · 2025-03-20T12:17:00Z

@gante I'm afraid that I can't but I can provide guidance.When fine-tuning PaliGemma 2 mix in mixed precision (bf16), the evaluation in the training loop of the HF Trainer is done in fp32 and so convert_to_fp32 is called.

IlyasMoutawwakil · 2025-03-20T12:50:35Z

This sounds like it was a silent error, because calling .float() on a Cache instance (when it was a nn.Module), doesn't actually do anything, since Cache.key_cache / Cache.value_cache are just lists and not module parameters.

In the case of fine-tuning I believe you need to pass use_cache=False so that Cache is not returned (and thus converting it is not attempted). We can implement .float() method that actually does the conversion of Cache.key_cache / Cache.value_cache depending on whether we want to silence this issue or force user to set use_cache=False when training, wdyt @gante ?

eljandoubi · 2025-03-20T13:41:43Z

@IlyasMoutawwakil Thanks for enlightening me. I think set use_cache=False automatically in the Trainer like when using gradient checkpointing would be great. Plus ignore objects that have no .float() method in _is_fp16_bf16_tensor.

gante · 2025-03-28T17:36:09Z

(see #37044)

simply make cache traceable

67dd552

IlyasMoutawwakil mentioned this pull request Jan 24, 2025

Make Cache a subclass of torch.Tensor #35792

Closed

5 tasks

IlyasMoutawwakil requested a review from gante January 24, 2025 12:38

guangy10 reviewed Jan 24, 2025

View reviewed changes

Merge branch 'main' into make-cache-traceable

a33dd6e

IlyasMoutawwakil mentioned this pull request Jan 26, 2025

Transformers 4.48 huggingface/optimum#2158

Merged

3 tasks

ArthurZucker reviewed Jan 27, 2025

View reviewed changes

guangy10 approved these changes Jan 28, 2025

View reviewed changes

gante approved these changes Jan 29, 2025

View reviewed changes

ArthurZucker approved these changes Feb 12, 2025

View reviewed changes

Merge branch 'main' into make-cache-traceable

d1bd324

tugsbayasgalan mentioned this pull request Feb 19, 2025

torch.export.export fails when one input is a class inheriting from torch.nn.Module pytorch/pytorch#147326

Open

IlyasMoutawwakil merged commit 5e2183f into main Feb 20, 2025
24 checks passed

IlyasMoutawwakil deleted the make-cache-traceable branch February 20, 2025 08:59

SunMarc mentioned this pull request Feb 20, 2025

[bugfix] Update modeling_llama.py so it skips keys correctly #36289

Closed

tugsbayasgalan mentioned this pull request Feb 20, 2025

Support tracable dynamicKVcache #36311

Merged

gante mentioned this pull request Mar 4, 2025

[Cache] Don't initialize the cache on meta device #36543

Merged

eljandoubi mentioned this pull request Mar 20, 2025

check that an object has .float() method huggingface/accelerate#3451

Closed

kaixuanliu mentioned this pull request Jul 10, 2025

Avoid registering pytree when using FSDP #39325

Closed

Make cache traceable #35873

Make cache traceable #35873

Uh oh!

Conversation

IlyasMoutawwakil commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

guangy10 Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jan 26, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jan 26, 2025

Uh oh!

IlyasMoutawwakil commented Jan 27, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Jan 27, 2025

Uh oh!

IlyasMoutawwakil commented Jan 27, 2025

Uh oh!

guangy10 left a comment

Choose a reason for hiding this comment

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

tugsbayasgalan commented Feb 19, 2025

Uh oh!

IlyasMoutawwakil commented Feb 19, 2025

Uh oh!

IlyasMoutawwakil commented Feb 19, 2025

Uh oh!

Uh oh!

eljandoubi commented Mar 17, 2025

Uh oh!

IlyasMoutawwakil commented Mar 17, 2025

Uh oh!

gante commented Mar 19, 2025

Uh oh!

eljandoubi commented Mar 20, 2025

Uh oh!

eljandoubi commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IlyasMoutawwakil commented Mar 20, 2025

Uh oh!

eljandoubi commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Mar 28, 2025

Uh oh!

Uh oh!

IlyasMoutawwakil commented Jan 24, 2025 •

edited

Loading

eljandoubi commented Mar 20, 2025 •

edited

Loading

eljandoubi commented Mar 20, 2025 •

edited

Loading