Fix FA2 for models with HybridCache #35681

Cyrilvallez · 2025-01-13T19:30:49Z

What does this PR do?

As per the title. Models with HybridCache need to correctly slice the key/value states when using FA2 as inputs needs to be unpadded on the right as well (and the mask has shape [bs, seq_len]). It is currently broken and leads to garbage generation when using padding. This fixes it.

Also do not slice the mask during prefill if longer than the sliding window (I added a test for this case).

HuggingFaceDocBuilderDev · 2025-01-13T19:58:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

For gemma2 it was suppose to work!
Not aligned with the removal of the attentoin mask slicing tho! Let's run slow test on the PR

ArthurZucker · 2025-01-14T08:22:09Z

src/transformers/models/gemma2/modeling_gemma2.py

-                if attention_mask.shape[-1] <= 1:  # when decoding
-                    attention_mask = attention_mask[:, :, :, -self.sliding_window :]


this is super important ! Why is it removed?
I know it is counter intuitive, but _flash_attention_forward takes the attention mask to pad / unpad the input itds.
Thus you need the slicing otherwise this operation fails, see the blame !

Indeed, I was too fast on this one, the HybridCache behaves slightly differently than I remembered. There was still an issue in the slicing during prefill for FA2 though!

correctly slice

e1247d1

Cyrilvallez requested a review from ArthurZucker as a code owner January 13, 2025 19:30

Cyrilvallez mentioned this pull request Jan 13, 2025

Fix some tests #35682

Open

ArthurZucker reviewed Jan 14, 2025

View reviewed changes

Cyrilvallez added 3 commits January 14, 2025 10:24

check mask

9c59b35

Update modular_gemma2.py

9f47f44

fix

ed2e3d7

Cyrilvallez mentioned this pull request Jan 14, 2025

Fix Gemma2 sliding window attention #35691

Closed

add tests

9414572

Cyrilvallez requested a review from Rocketknight1 as a code owner January 14, 2025 15:17

Cyrilvallez added 2 commits January 14, 2025 15:22

fix typo

7c31343

finally fix mask slicing

8004a70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FA2 for models with HybridCache #35681

Fix FA2 for models with HybridCache #35681

Cyrilvallez commented Jan 13, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 13, 2025

ArthurZucker left a comment

ArthurZucker Jan 14, 2025

Cyrilvallez Jan 14, 2025

		if attention_mask.shape[-1] <= 1: # when decoding
		attention_mask = attention_mask[:, :, :, -self.sliding_window :]

Fix FA2 for models with HybridCache #35681

Are you sure you want to change the base?

Fix FA2 for models with HybridCache #35681

Conversation

Cyrilvallez commented Jan 13, 2025 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jan 13, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 14, 2025

Choose a reason for hiding this comment

Cyrilvallez Jan 14, 2025

Choose a reason for hiding this comment

Cyrilvallez commented Jan 13, 2025 •

edited

Loading