feat(FLCE): add helion version of fused linear cross entropy #928

Tcc0403 · 2025-11-03T15:39:38Z

Summary

TODO (might be follow-up PRs):

unit test
autotune
benchmark

Testing Done

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Tcc0403 · 2025-11-04T03:51:02Z

The new test file avoid importing from test.utils because torchvision is not compatible with pytorch>=2.9.0 and triton>=3.5.0 (required by helion), which leads to ImportError.

Some test cases failed due to

nan in gradients of lm_head.weight. (weird shape -> padding issue? maybe zero out oob grad_logit_tile)
numerical error with dtype=torch.float -> need higher tolerance

python3 -m pytest test/transformers/helion/test_fused_linear_cross_entropy.py --log-cli-level="WARNING"

FAILED test/transformers/helion/test_fused_linear_cross_entropy.py::test_fused_linear_cross_entropy_correctness[dtype0-0.01-0.01-sum-2-1024-4096-32000] - AssertionError: Tensor-likes are not close!
FAILED test/transformers/helion/test_fused_linear_cross_entropy.py::test_fused_linear_cross_entropy_correctness[dtype0-0.01-0.01-sum-3-423-1000-10000] - AssertionError: lm_head.weight of liger contains nan
FAILED test/transformers/helion/test_fused_linear_cross_entropy.py::test_fused_linear_cross_entropy_correctness[dtype0-0.01-0.01-mean-3-423-1000-10000] - AssertionError: lm_head.weight of liger contains nan
FAILED test/transformers/helion/test_fused_linear_cross_entropy.py::test_fused_linear_cross_entropy_correctness[dtype1-0.001-0.01-sum-2-1024-4096-32000] - AssertionError: Tensor-likes are not close!
FAILED test/transformers/helion/test_fused_linear_cross_entropy.py::test_fused_linear_cross_entropy_correctness[dtype1-0.001-0.01-sum-3-423-1000-10000] - AssertionError: lm_head.weight of liger contains nan
FAILED test/transformers/helion/test_fused_linear_cross_entropy.py::test_fused_linear_cross_entropy_correctness[dtype1-0.001-0.01-mean-3-423-1000-10000] - AssertionError: lm_head.weight of liger contains nan
============================================================ 6 failed, 6 passed in 30.64s =============================================================

Tcc0403 · 2025-11-04T07:44:48Z

Autotuned with the following shapes on H100 SXM5:

batch_size = 2
seq_len = 2048
hidden_size = 4096
vocab_size = 32000
dtype = torch.float32
reduction = "mean"

Here's the result:

[8430s] Generation 20 complete: error=23 timeout=2 ok=96 min=56.2426 mid=76.8076 max=184.9636 best=Config(block_sizes=[32, 32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', '', 'first', 'last', '', '', 'last', 'first'], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, True, False], range_num_stages=[0, 0, 0], range_unroll_factors=[0, 1, 1], range_warp_specializes=[])
[8430s] Autotuning complete in 8430.1s after searching 3711 configs.
One can hardcode the best config and skip autotuning with:
    @helion.kernel(config=helion.Config(block_sizes=[32, 32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', '', 'first', 'last', '', '', 'last', 'first'], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, True, False], range_num_stages=[0, 0, 0], range_unroll_factors=[0, 1, 1], range_warp_specializes=[]), static_shapes=True)

It took more than 2 hours to do a full autotune. Current implementation is extremely slow.

Tcc0403 · 2025-11-04T08:13:29Z

Benchmarked with 8x4096 input length, hidden_size=4096, vocab_size=32000

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd      493.3454     0.06x          
torch_fwd            27.3347      1.00x (ref)    
=================================================================

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd_bwd  506.0976     0.18x          
torch_fwd_bwd        92.5972      1.00x (ref)    
=================================================================

~~There are 2 hl.atomic_add() in the most inner loop for 2 backprop matmul. Replacing them with a lock version add should improve the performance a lot.~~ Not really, atomic add is not the main bottleneck.

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 · 2025-11-06T15:29:09Z

Llama3(H=4096, V=32000)

BT=4096

Details

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd      6.3785       0.54x          
torch_fwd            3.4434       1.00x (ref)    
cce_fwd              19.3814      0.18x          
triton_flce_fwd      17.1954      0.20x          
=================================================================
=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd_bwd  50.0497      0.22x          
torch_fwd_bwd        11.1041      1.00x (ref)    
cce_fwd_bwd          61.2685      0.18x          
triton_flce_fwd_bwd  17.8189      0.62x          
=================================================================

BT=32768

Details

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd      29.0221      1.02x          
torch_fwd            29.7404      1.00x (ref)    
cce_fwd              153.8144     0.19x          
triton_flce_fwd      80.8064      0.37x          
=================================================================
=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd_bwd  374.1972     0.22x          
torch_fwd_bwd        93.4857      0.88x          
cce_fwd_bwd          477.5481     0.17x          
triton_flce_fwd_bwd  82.4108      1.00x (ref)    
=================================================================

Qwen3(H=4096, V=151936)

BT=32768

Details

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd      71.3643      1.02x          
torch_fwd            73.1353      1.00x (ref)    
cce_fwd              413.2089     0.18x          
triton_flce_fwd      335.1601     0.22x          
=================================================================
=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd_bwd  882.6362     0.27x          
torch_fwd_bwd        234.5237     1.00x (ref)    
cce_fwd_bwd          775.0858     0.30x          
triton_flce_fwd_bwd  338.7038     0.69x          
=================================================================

Gemma3(H=2304, V=262208)

BT=8192

Details

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd      34.3520      1.10x          
torch_fwd            37.8505      1.00x (ref)    
cce_fwd              348.1931     0.11x          
triton_flce_fwd      338.0345     0.11x          
=================================================================
=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_flce_fwd_bwd  454.7143     0.27x          
torch_fwd_bwd        120.9745     1.00x (ref)    
cce_fwd_bwd          510.3440     0.24x          
triton_flce_fwd_bwd  340.8792     0.35x          
=================================================================

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 · 2025-11-06T15:59:56Z

This helion implementation never materializes any logits on device memroy. The forward pass works well, but backward pass suffers from dw and dx matmuls. Both matmuls are quite inefficient due to the inner loops are not the reduction dimension, which means we have to perform atomic_add or lock_add each iterations. I'll write a version with partially materialized logits (similar to current liger impl) which can perform more efficient matmuls for dw and dx.

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 · 2025-11-08T20:15:38Z

Llama(H=4096, V=32000, dtype=fp32)

BT=32768

chunk backward (recompute chunk logits then accumulate dx and dw) works fine.

Details

=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_fwd           52.8671      0.89x          
torch_fwd            47.1587      1.00x (ref)    
cce_fwd              191.8264     0.25x          
triton_flce_fwd      131.3960     0.36x          
=================================================================
=================================================================
Benchmark Results
=================================================================
Implementation       Time (ms)    Speedup        
-----------------------------------------------------------------
helion_fwd_bwd_chunk 221.1143     0.64x          
torch_fwd_bwd        147.3726     0.96x          
cce_fwd_bwd          648.3762     0.22x          
triton_flce_fwd_bwd  141.1500     1.00x (ref)    
=================================================================

Tcc0403 · 2025-11-11T16:09:37Z

Some benchmark results

There seems to be a constant overhead when running
LigerLMHeadCEHelion(H=H, V=V, dtype=dtype, grad_in_forward=True). Need to fix it.

speed

full

forward

backward

memory

full

Tcc0403 · 2025-11-11T17:53:10Z

Fusing logits computation and softmax doesn't seem like a good idea, less parallelism in small shapes. I will make another version closer to current liger impl.

Tcc0403 mentioned this pull request Nov 3, 2025

Investigate Helion for kernel authoring/tuning #923

Open

Tcc0403 mentioned this pull request Nov 5, 2025

add support for while/pass pytorch/helion#1086

Closed

Tcc0403 added 12 commits November 6, 2025 15:58

feat(FLCE): add helion version of fused linear cross entropy

672c8c8

compute dx

fc9a406

Signed-off-by: Tcc0403 <[email protected]>

add grad_x, grad_w computation

5c81648

Signed-off-by: Tcc0403 <[email protected]>

clean up

8e9c13e

Signed-off-by: Tcc0403 <[email protected]>

format

368433b

Signed-off-by: Tcc0403 <[email protected]>

Fix incorrect grad_w computation with reduction="mean"

164e63a

Signed-off-by: Tcc0403 <[email protected]>

Add unit test

b50870c

Signed-off-by: Tcc0403 <[email protected]>

Improve n_non_ignore read efficiency and ERROR comments

f890711

Signed-off-by: Tcc0403 <[email protected]>

Set higher tolerance

04f07fa

Signed-off-by: Tcc0403 <[email protected]>

Add benchmark

8781e8e

Signed-off-by: Tcc0403 <[email protected]>

Unfuse forward/backward and use lock

98de5b9

Signed-off-by: Tcc0403 <[email protected]>

testing misc

30b12eb

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 force-pushed the tcc/helion-flce branch from 1f25784 to 30b12eb Compare November 6, 2025 07:59

Tcc0403 added 3 commits November 6, 2025 18:28

Add cut_cross_entropy comparison

f402ff3

Signed-off-by: Tcc0403 <[email protected]>

Add LigerFusedLinearCrossEntropy for comparison

c156022

Signed-off-by: Tcc0403 <[email protected]>

Add IMA error comment to liger flce

719344d

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 mentioned this pull request Nov 6, 2025

FusedLinearCrossEntropy got illegal memory access issue with triton 3.5.0 and PyTorch 2.9 #916

Open

Tcc0403 added 2 commits November 6, 2025 19:57

Fix incorrect liger flce args positions

81d0e98

Signed-off-by: Tcc0403 <[email protected]>

Remove lock functions wrappers

fd05527

Signed-off-by: Tcc0403 <[email protected]>

Update best configs for h100 with BT=2048, H=4096, V=32000

34decd4

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 added 3 commits November 6, 2025 16:03

Clean up handwriting test and let run_example() handle correctness test

6905390

Signed-off-by: Tcc0403 <[email protected]>

Add chunk version of flce backward

b508243

Signed-off-by: Tcc0403 <[email protected]>

Add autotune misc

d901dac

Signed-off-by: Tcc0403 <[email protected]>

Tcc0403 added 5 commits November 8, 2025 19:38

Fix ignore_index

5ecb417

Signed-off-by: Tcc0403 <[email protected]>

Fix backward ctx.savetensors

74553dc

Signed-off-by: Tcc0403 <[email protected]>

clean up

d2b2372

Signed-off-by: Tcc0403 <[email protected]>

Fix autotune fuction

07deb98

Signed-off-by: Tcc0403 <[email protected]>

Add h100 autotune configs

06bde20

Signed-off-by: Tcc0403 <[email protected]>

Fix reduction!="mean" and add benchmark script

4eb0a69

Merge branch 'main' into tcc/helion-flce

cafa211

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(FLCE): add helion version of fused linear cross entropy #928

feat(FLCE): add helion version of fused linear cross entropy #928

Uh oh!

Tcc0403 commented Nov 3, 2025 •

edited

Loading

Uh oh!

Tcc0403 commented Nov 4, 2025 •

edited

Loading

Uh oh!

Tcc0403 commented Nov 4, 2025 •

edited

Loading

Uh oh!

Tcc0403 commented Nov 4, 2025 •

edited

Loading

Uh oh!

Tcc0403 commented Nov 6, 2025 •

edited

Loading

Uh oh!

Tcc0403 commented Nov 6, 2025 •

edited

Loading

Uh oh!

Tcc0403 commented Nov 8, 2025

Uh oh!

Tcc0403 commented Nov 11, 2025

Uh oh!

Tcc0403 commented Nov 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(FLCE): add helion version of fused linear cross entropy #928

Are you sure you want to change the base?

feat(FLCE): add helion version of fused linear cross entropy #928

Uh oh!

Conversation

Tcc0403 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

Tcc0403 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tcc0403 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tcc0403 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tcc0403 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Llama3(H=4096, V=32000)

Qwen3(H=4096, V=151936)

Gemma3(H=2304, V=262208)

Uh oh!

Tcc0403 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tcc0403 commented Nov 8, 2025

Llama(H=4096, V=32000, dtype=fp32)

Uh oh!

Tcc0403 commented Nov 11, 2025

speed

memory

Uh oh!

Tcc0403 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tcc0403 commented Nov 3, 2025 •

edited

Loading

Tcc0403 commented Nov 4, 2025 •

edited

Loading

Tcc0403 commented Nov 4, 2025 •

edited

Loading

Tcc0403 commented Nov 4, 2025 •

edited

Loading

Tcc0403 commented Nov 6, 2025 •

edited

Loading

Tcc0403 commented Nov 6, 2025 •

edited

Loading

Tcc0403 commented Nov 11, 2025 •

edited

Loading