Enable FlashInfer support encoder models and add head_dim padding workaround #6230

ccs96307 · 2025-05-12T10:49:05Z

Motivation

This PR aims to enhance the FlashInfer attention backend in SGLang to address two primary goals:

Enable support for encoder-only models: Currently, the FlashInfer backend needs adjustments to correctly handle non-causal attention required by encoder architectures.
Resolve an "Invalid configuration" error for specific head dimensions: When using encoder models with certain head dimensions (e.g., head_dim=32 as found in Supabase/gte-small and potentially other BGE-like models) with FlashInfer's ragged prefill operations, an internal error is triggered, preventing these models from running.

The original issue is: #6050

Modifications

This PR introduces the following key changes:

1. Encoder Model Support (Non-Causal Attention):

In FlashInferAttnBackend.forward_extend, the causal flag is now dynamically determined. For layers with layer.attn_type == AttentionType.ENCODER_ONLY, causal is set to False to enable bidirectional (non-causal) attention.
For encoder self-attention, save_kv_cache is also appropriately set to False as KV caching across layers is typically not used in the same way as in decoders.

2. Workaround for FlashInfer head_dim Limitation (e.g., for head_dim=32):
FlashInfer currently fails when using BatchPrefillWithRaggedKVCacheWrapper with head_dim < 64 (e.g., 32). To work around this, we pad the head dimension up to 64 during prefill and forward steps:

A global variable global_fake_head_dim (default: 64) controls the padded size.
During prefill:
- If the model’s head_dim is less than global_fake_head_dim, we use the padded fake_head_dim for planning (begin_forward), but keep sm_scale based on the original head_dim for correctness.
During forward:
- Q, K, and V tensors are padded along the head dimension.
- sm_scale remains based on the original head_dim.
- FlashInfer returns output with the padded size, which we truncate back to the original shape.

This workaround is temporary until native support for head_dim < 64 is available in FlashInfer.

3. Verification and Results:
The effectiveness of these changes, particularly the padding workaround for gte-small (or a similar model with head_dim=32), was verified by comparing the FlashInfer backend's output (final embedding logits, e.g., shape (10000, 768)) against Triton and a native PyTorch attention implementation (torch_native).

Numerical Similarity (vs torch_native for gte-small like model):

torch.allclose (rtol=0.01, atol=0.001):
- FlashInfer: True
- Triton: True
torch.allclose (rtol=0.001, atol=0.0001):
- FlashInfer: False
- Triton: False
Mean Absolute Error (MAE):
- FlashInfer: 1.89077000e-05
- Triton: 1.78243699e-05
Maximum Absolute Error:
- FlashInfer: 9.76562500e-04
- Triton: 9.76562500e-04

These results show that the padded FlashInfer backend achieves MAE on the order of ~1.8e-5 compared to the native PyTorch version, similar to Triton. The slightly larger maximum error and failure for tighter allclose tolerances are common for optimized kernels, especially with float16/bfloat16 dtypes, and are considered within acceptable limits.

Performance (seconds / 10,000 requests, for gte-small like model):

FlashInfer (padded): 39.551 seconds
Triton: 39.144 seconds
Torch Native: 46.192 seconds

The padded FlashInfer backend demonstrates performance comparable to Triton and significantly improves over the native PyTorch implementation.

I'm open to discussing whether the current solution is appropriate. It might be better to remove the temporary workaround and retain only the causal check, especially if full FlashInfer support is expected soon.

That said, I'm so happy to keep the workaround in place while we wait for FlashInfer support to land.
Thank you for taking the time to review this -- I'm open to any suggestions.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

ccs96307 · 2025-05-14T13:59:44Z

Hi,

I noticed some tests failed. There seem to be a couple of issues:

One error is Error: fatal: remote error: upload-pack: not our ref d9e280a70f7be9b97bc7ba2fcd3bc17c2dbf23cc. I've recently updated this PR branch by merging the latest changes from main (the current head of this PR is d86966e), so this checkout error might have occurred if the test was running on a previous, now-stale reference.
Another issue is ValueError: Unrecognized model in neuralmagic/Qwen2-7B-Instruct-FP8.... I suspect this might be unrelated to the changes in my PR.

Could you kindly re-run the CI workflow on the latest commit (d86966e) of this PR when you get a chance?

Thanks for your help!

Fridge003 · 2025-05-14T18:46:24Z

Thanks for your contribution! But I still have some confusions.

In the benchmark result, the performance of flashinfer backend is similar to triton backend (even a little slower). But usually Flashinfer should be significantly faster than triton, so I guess this is probably due to the padding process which wastes a lot of computation resources.

Maybe a better way should be raising an issue in the flashinfer repo, and pushing them to implement head_dim=32. Or I don't see any reason of using flashinfer backend for encoder model instead of triton, since triton is better in both flexibility and performance. The padding adds to the code complexity and makes it harder for us to maintain the codes.

ccs96307 · 2025-05-15T01:27:33Z

Thanks for your review, @Fridge003. You're right, the head_dim padding workaround adds complexity without a clear performance win over Triton in this scenario.

I'm happy to remove the padding. This PR will then focus on enabling non-causal attention for encoders (the causal flag logic), allowing FlashInfer to be used with encoder models that have natively supported head dimensions (in my test, at lease BGE-m3 is ok).

The FlashInfer head_dim limitation itself is tracked here: flashinfer-ai/flashinfer#1048.

If this revised approach is acceptable, I'll update the PR. Thanks!

Fridge003 · 2025-05-15T01:32:06Z

Thanks for your review, @Fridge003. You're right, the head_dim padding workaround adds complexity without a clear performance win over Triton in this scenario.

I'm happy to remove the padding. This PR will then focus on enabling non-causal attention for encoders (the causal flag logic), allowing FlashInfer to be used with encoder models that have natively supported head dimensions (in my test, at lease BGE-m3 is ok).

The FlashInfer head_dim limitation itself is tracked here: flashinfer-ai/flashinfer#1048.

If this revised approach is acceptable, I'll update the PR. Thanks!

Thanks for your update~ You can remove the padding logics first and add a comment that wait for update from flashinfer. On the CI part you can skip the models with head_dim lower than 64 for flashinfer backend.

…port-encoder

ccs96307 · 2025-05-15T06:43:13Z

Hi, I tried removing the padding workaround and re-ran my test (this time testing 100,000 requests on BGE-m3 with async POST):

flashinfer: 169 seconds / 100,000 requests (0.00169 seconds per request)
triton: 185 seconds / 100,000 requests (0.00185 seconds per request)
torch_native: 405 seconds / 100,000 requests (0.00405 seconds per request)

Fridge003

LGTM

ccs96307 · 2025-06-17T09:02:08Z

Hi @Fridge003 and team,

It seems the CI checks (amd_ci_exec.sh python3 test_eval_accuracy_large.py) failed.

After reviewing the logs, the failure occurred in the test_human_eval benchmark. The model scored 0.639, which is just slightly below the required threshold of 0.64. The test was retried once but failed again with a similar score.

Given that my PR focuses on enabling FlashInfer for encoder models, and this test evaluates the general accuracy of a large decoder model (Llama-3.1-8B-Instruct) on an AMD/ROCm platform, I suspect this might be a flaky test and likely unrelated to my changes.

Could you please help confirm this or suggest how to proceed? Perhaps the CI job could be re-run?

Thanks for your help!

ccs96307 added 2 commits May 12, 2025 05:56

Add Flashinfer support for encoder-only model

119d60b

reformat

f5794ce

ccs96307 requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw and ch-wan as code owners May 12, 2025 10:49

ccs96307 added 2 commits May 12, 2025 18:50

Merge branch 'main' into flashinfer-attn-support-encoder

91e8412

Fix: determine it is not generate and not encoder-decoder architecture

c4e6f0e

Fridge003 self-assigned this May 12, 2025

ccs96307 added 2 commits May 13, 2025 10:00

Merge branch 'main' into flashinfer-attn-support-encoder

d35a35a

Merge branch 'main' into flashinfer-attn-support-encoder

d86966e

Merge branch 'main' into flashinfer-attn-support-encoder

7bcec6a

ccs96307 added 3 commits May 15, 2025 14:34

Merge remote-tracking branch 'upstream/main' into flashinfer-attn-sup…

6945a23

…port-encoder

Delete padding workaround

154232b

Add description for limitation of flashinfer

212591f

ccs96307 and others added 2 commits May 15, 2025 17:59

Merge branch 'main' into flashinfer-attn-support-encoder

75e35d1

Merge branch 'main' into flashinfer-attn-support-encoder

02165d3

Fridge003 approved these changes May 15, 2025

View reviewed changes

Merge branch 'main' into flashinfer-attn-support-encoder

ff4595c

Fridge003 requested a review from BBuf as a code owner May 17, 2025 19:16

Merge branch 'main' into flashinfer-attn-support-encoder

4d6efb4

Fridge003 added the ready-to-merge The PR is ready to merge after the CI is green. label May 19, 2025

Merge branch 'main' into flashinfer-attn-support-encoder

60c2bc1

Fridge003 removed the ready-to-merge The PR is ready to merge after the CI is green. label May 20, 2025

ccs96307 added 2 commits June 10, 2025 15:53

Merge branch 'main' into flashinfer-attn-support-encoder

d4ff9a8

Merge branch 'main' into flashinfer-attn-support-encoder

1c65c7c

ccs96307 and others added 2 commits June 18, 2025 09:16

Merge branch 'main' into flashinfer-attn-support-encoder

29d1be5

Merge branch 'main' into flashinfer-attn-support-encoder

33a8c83

woodx9 approved these changes Jun 25, 2025

View reviewed changes

ccs96307 and others added 3 commits June 25, 2025 18:43

Merge branch 'main' into flashinfer-attn-support-encoder

f52c693

Merge branch 'main' into flashinfer-attn-support-encoder

4a888d1

Merge branch 'main' into flashinfer-attn-support-encoder

b680370

Fridge003 added the ready-to-merge The PR is ready to merge after the CI is green. label Jul 12, 2025

ccs96307 and others added 2 commits July 14, 2025 14:25

Merge branch 'main' into flashinfer-attn-support-encoder

2c44ddc

Merge branch 'main' into flashinfer-attn-support-encoder

e8b01fe

zhyncs merged commit cbdfb77 into sgl-project:main Jul 20, 2025
22 of 60 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable FlashInfer support encoder models and add head_dim padding workaround #6230

Enable FlashInfer support encoder models and add head_dim padding workaround #6230

ccs96307 commented May 12, 2025

Uh oh!

ccs96307 commented May 14, 2025

Uh oh!

Fridge003 commented May 14, 2025 •

edited

Loading

Uh oh!

ccs96307 commented May 15, 2025

Uh oh!

Fridge003 commented May 15, 2025

Uh oh!

ccs96307 commented May 15, 2025 •

edited

Loading

Uh oh!

Fridge003 left a comment

Uh oh!

ccs96307 commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Enable FlashInfer support encoder models and add head_dim padding workaround #6230

Enable FlashInfer support encoder models and add head_dim padding workaround #6230

Conversation

ccs96307 commented May 12, 2025

Motivation

Modifications

Checklist

Uh oh!

ccs96307 commented May 14, 2025

Uh oh!

Fridge003 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccs96307 commented May 15, 2025

Uh oh!

Fridge003 commented May 15, 2025

Uh oh!

ccs96307 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

ccs96307 commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented May 14, 2025 •

edited

Loading

ccs96307 commented May 15, 2025 •

edited

Loading