fix FlashAttentionKwargs RoPE #35941

garrett361 · 2025-01-28T17:35:00Z

What does this PR do?

#33932 introduced FlashAttentionKwargs as an alternative to using position_ids for padding-free training. The RoPE positional embedding are not currently applied correctly in the FlashAttentionKwargs code path. This PR ensures that RoPE is used properly for this path.

Code Notes

The Issue

The issue is that if position_ids not provided, then they are internally generated here:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 561 to 562 in ec7afad

    
           if position_ids is None: 
        
               position_ids = cache_position.unsqueeze(0)

and these are used to generate the rope embeddings here:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 570 to 571 in ec7afad

    
           # create position embeddings to be shared across the decoder layers 
        
           position_embeddings = self.rotary_emb(hidden_states, position_ids)

These rope embeddings are ~ torch.arange, whereas they should be non-trivially generated from the values in FlashAttentionKwargs

The Fix

Introduce a get_position_ids_from_cu_seq_lens helper which coverts from FlashAttentionKwargs -> position_ids, when provided.

Because many other models inherit from LlamaDecoder, this change propagates changes to many other models via modular_model_converter.py.

Tests

The solution is tested in LlamaModelTest::test_attn_mask_position_ids_flash_attn_equality, which checks that logits in the follow cases are consistent with each other:

No padding-free, just padding and attention masks
Padding free via position_ids
Padding free via FlashAttentionKwargs

This test fails on latest main without the above fix.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Rocketknight1 · 2025-01-29T13:23:21Z

cc @Abhishek-TAMU @ArthurZucker because this is an update to #33932

garrett361 · 2025-02-03T15:48:28Z

@Abhishek-TAMU @ArthurZucker I removed the draft status and this work is ready for review. Please let me know if I can answer any questions about this PR. Thank you!

Rocketknight1 · 2025-02-04T17:37:17Z

cc @Cyrilvallez as well actually, since I think this relates to RoPE code you touched recently

Cyrilvallez · 2025-02-06T11:07:31Z

Hey @garrett361! Very nice catch, this is indeed quite important! Here are a few thoughts/guidelines:

First, as the function is a helper directly related to FA2, it should be moved to modeling_flash_attention_utils.py.
Second, may I know a bit more about the setting in which you use this? I kind of feel that when using packed tensor format, it should be the responsibility of the user to feed correct inputs to the model. It would avoid one more code path in our modeling.
Depending on where/how you use this, I suspect it could call the function upstream, and then feed correct position_ids to the model.
Let me know what you think!

garrett361 · 2025-02-06T18:28:37Z

Very nice catch, this is indeed quite important!

Thanks!

First, as the function is a helper directly related to FA2, it should be moved to modeling_flash_attention_utils.py.

Makes sense to me.

Second, may I know a bit more about the setting in which you use this?

I hit this in the process of testing #35861, which makes related changes for padding-free training. See there for some related discussion, as well.

I kind of feel that when using packed tensor format, it should be the responsibility of the user to feed correct inputs to the model. It would avoid one more code path in our modeling.

Agree it would be best if all of the needed data (both position_ids and cu_seq_len_x, in the present case) were computed and provided at the outset by, say, the dataloader. But there seem to be a bunch of different available code paths at the moment and am concerned about silent incorrectness issues if we don't have these kinds of helpers in the modeling code.

Examples:

HF's DataCollatorWithFlattening only returns position_ids
trl's DataCollatorForCompletionOnlyLM does what we want and returns all of {position_ids, cu_seq_lens_q, ...}.

What do you suggest for next steps?

garrett361 · 2025-02-11T16:29:11Z

First, as the function is a helper directly related to FA2, it should be moved to modeling_flash_attention_utils.py.
@Cyrilvallez @Rocketknight1 I moved the helper as requested.

Please let me know if any more is needed from my end!

ArthurZucker · 2025-02-14T09:27:46Z

Hey! Sorry, I agree with @Cyrilvallez and I think we should rather update / fix our Datacollator to make sure it passes the position ids and cu seqlens. We really don't want to add code that is specific to 1 integration path!

ArthurZucker · 2025-02-14T09:28:04Z

🤗

garrett361 · 2025-02-14T13:27:57Z

Ok cool @ArthurZucker , so close this PR and adjust DataCollatorWithFlattening so that it returns {position_ids, cu_seq_lens_q, ...}?

garrett361 · 2025-03-04T14:44:19Z

Closing this: the intended padding-free code path with FlashAttentionKwargs is that both the FlashAttentionKwargs and position_ids are provided to the model.

I plan to open a separate PR which sanity checks this and raises a ValueError if only FlashAttentionKwargs are provided, along with making the FlashAttentionKwargs explicit, properly typed args.

fix FlashAttentionKwargs RoPE

08da972

garrett361 marked this pull request as draft January 28, 2025 17:35

run modular_model_converter.py

0804ea5

garrett361 marked this pull request as ready for review February 3, 2025 15:47

minimize tensor allocations

06a268e

garrett361 mentioned this pull request Feb 6, 2025

Add padding-free to bamba #35861

Merged

5 tasks

garrett361 added 2 commits February 11, 2025 16:26

mv get_position_ids_from_cu_seq_lens to flash util

973d362

minor import fix

0fb7312

garrett361 mentioned this pull request Feb 27, 2025

add FlashAttentionKwargs and seq_idx to flat collator #36456

Merged

5 tasks

garrett361 closed this Mar 4, 2025

vasqu mentioned this pull request Jul 1, 2025

[qwen2-vl] fix FA2 inference #39121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix FlashAttentionKwargs RoPE #35941

fix FlashAttentionKwargs RoPE #35941

Uh oh!

garrett361 commented Jan 28, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Jan 29, 2025

Uh oh!

garrett361 commented Feb 3, 2025

Uh oh!

Rocketknight1 commented Feb 4, 2025

Uh oh!

Cyrilvallez commented Feb 6, 2025

Uh oh!

garrett361 commented Feb 6, 2025

Uh oh!

garrett361 commented Feb 11, 2025

Uh oh!

ArthurZucker commented Feb 14, 2025

Uh oh!

ArthurZucker commented Feb 14, 2025

Uh oh!

garrett361 commented Feb 14, 2025 •

edited

Loading

Uh oh!

garrett361 commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	if position_ids is None:
	position_ids = cache_position.unsqueeze(0)

	# create position embeddings to be shared across the decoder layers
	position_embeddings = self.rotary_emb(hidden_states, position_ids)

fix FlashAttentionKwargs RoPE #35941

fix FlashAttentionKwargs RoPE #35941

Uh oh!

Conversation

garrett361 commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Notes

The Issue

The Fix

Tests

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Jan 29, 2025

Uh oh!

garrett361 commented Feb 3, 2025

Uh oh!

Rocketknight1 commented Feb 4, 2025

Uh oh!

Cyrilvallez commented Feb 6, 2025

Uh oh!

garrett361 commented Feb 6, 2025

Uh oh!

garrett361 commented Feb 11, 2025

Uh oh!

ArthurZucker commented Feb 14, 2025

Uh oh!

ArthurZucker commented Feb 14, 2025

Uh oh!

garrett361 commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garrett361 commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

garrett361 commented Jan 28, 2025 •

edited

Loading

garrett361 commented Feb 14, 2025 •

edited

Loading