TE: Fix redundant compute for PEFT using transform #2138

kshitij12345 · 2025-05-26T13:25:26Z

Fixes: #2076

TODO

Tested with RTX6000, verify for sanity on H100 and B200.

Use a pass based on the backward trace to determine if wgrad and bgrad are computed and update forward trace accordingly.

Have added a test for the same (and verified existing tests single-GPU and distributed).

NOTE:

Changes only for v1 executor, similar changes are needed in v2 executor.

Example Program:

    with torch.device("cuda"):
        model = torch.nn.Sequential(*(torch.nn.Linear(32, 32, bias=False) for _ in range(4)))
        x = torch.randn(32, 32, requires_grad=True)

    for idx, parameters in enumerate(model.parameters()):
        # Every even linear layer's weight is frozen.
        if idx % 2 == 0:
            parameters.requires_grad = False

Forward Trace

@transformer_engine.fp8_autocast(fp8_recipe=te_fp8_recipe)
@torch.no_grad()
@no_autocast
def computation(input, t_0_weight, t_1_weight, t_2_weight, t_3_weight):
  # input: "cuda:0 f32[32, 32]"
  # t_0_weight: "cuda:0 f32[32, 32]"
  # t_1_weight: "cuda:0 f32[32, 32]"
  # t_2_weight: "cuda:0 f32[32, 32]"
  # t_3_weight: "cuda:0 f32[32, 32]"

  # /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:125:             return F.linear(input, self.weight, self.bias)
  (t27, (t19, t20, t21, t22, t23, t24), ctx_te_1418) = te_linear_13(input, t_0_weight, None, input_requires_grad=True, weight_requires_grad=False, bias_requires_grad=False)
  (t42, (t34, t35, t36, t37, t38, t39, t40, t41), ctx_te_1533) = te_linear_14(t27, t_1_weight, None, input_requires_grad=True, weight_requires_grad=True, bias_requires_grad=False)
  del t27

  # /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:125:             return F.linear(input, self.weight, self.bias)
  (t57, (t49, t50, t51, t52, t53, t54), ctx_te_1648) = te_linear_15(t42, t_2_weight, None, input_requires_grad=True, weight_requires_grad=False, bias_requires_grad=False)
  del t42

  # /usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py:125:             return F.linear(input, self.weight, self.bias)
  (t72, (t64, t65, t66, t67, t68, t69, t70, t71), ctx_te_1763) = te_linear_16(t57, t_3_weight, None, input_requires_grad=True, weight_requires_grad=True, bias_requires_grad=False)
  del t57
  return {'output': (t72,), 'flat_args': [input, t_0_weight, t_1_weight, t_2_weight, t_3_weight], 'flat_output': (t72,)}, ((t19, t20, t21, t22, t23, t24, t34, t35, t36, t37, t38, t39, t40, t41, t49, t50, t51, t52, t53, t54, t64, t65, t66, t67, t68, t69, t70, t71), (ctx_te_1418, ctx_te_1533, ctx_te_1648, ctx_te_1763))

Backward Trace

def backward_fn(saved_for_backward, cotangents):
  # saved_for_backward: "Collection"
  # cotangents: "Collection"
  C0, C1, = saved_for_backward
  # C0: "Collection"
  # C1: "Collection"
  clear_mutable_collection(saved_for_backward)
  del saved_for_backward
  t73, = cotangents
  # t73: "cuda:0 f32[32, 32]"
  clear_mutable_collection(cotangents)
  del cotangents
  t19, t20, t21, t22, t23, t24, t34, t35, t36, t37, t38, t39, t40, t41, t49, t50, \
  t51, t52, t53, t54, t64, t65, t66, t67, t68, t69, t70, t71, = C0
  clear_mutable_collection(C0)
  del C0
  ctx_te_1418, ctx_te_1533, ctx_te_1648, ctx_te_1763, = C1
  clear_mutable_collection(C1)
  del C1
  (bw_t74, grad_for_t_3_weight, _) = te_functional_linear_backward((32, 32), (32, 32), None, ctx_te_1763, (t64, t65, t66, t67, t68, t69, t70, t71), t73, input_requires_grad=True, weight_requires_grad=True, bias_requires_grad=False)
  del ctx_te_1763, t64, t65, t66, t67, t68, t69, t70, t71, t73
  (bw_t59, _, _) = te_functional_linear_backward((32, 32), (32, 32), None, ctx_te_1648, (t49, t50, t51, t52, t53, t54), bw_t74, input_requires_grad=True, weight_requires_grad=False, bias_requires_grad=False)
  del ctx_te_1648, t49, t50, t51, t52, t53, t54, bw_t74
  (bw_t44, grad_for_t_1_weight, _) = te_functional_linear_backward((32, 32), (32, 32), None, ctx_te_1533, (t34, t35, t36, t37, t38, t39, t40, t41), bw_t59, input_requires_grad=True, weight_requires_grad=True, bias_requires_grad=False)
  del ctx_te_1533, t34, t35, t36, t37, t38, t39, t40, t41, bw_t59
  (grad_for_input, _, _) = te_functional_linear_backward((32, 32), (32, 32), None, ctx_te_1418, (t19, t20, t21, t22, t23, t24), bw_t44, input_requires_grad=True, weight_requires_grad=False, bias_requires_grad=False)
  del ctx_te_1418, t19, t20, t21, t22, t23, t24, bw_t44
  te_sync_fp8_meta_bwd()
  return (grad_for_input, None, grad_for_t_1_weight, None, grad_for_t_3_weight)

into te-frozen-weights

for more information, see https://pre-commit.ci

…ightning-thunder into te-frozen-weights

riccardofelluga

This is a nice fix, however we cannot rely on the assumption that requires_grad is always propagated throughout the trace and I think we should move away from that assumption unless we make sure that propagation is always guaranteed.

A more involved alternative would be to pickup on the runtime proxy idea

thunder/tests/test_transformer_engine_executor.py

riccardofelluga · 2025-05-28T08:58:35Z

thunder/executors/transformer_engineex.py

+            dgrad, wgrad, bgrad = bsym.output
+            w_requires_grad = True if wgrad is not None else False
+            b_requires_grad = True if bgrad is not None else False


Interesting hack for requires_grad propagation, tho if the symbol before the one captured by TE executor did not propagate requires_grad this might not work as intended

kshitij12345 · 2025-05-28T14:24:38Z

This is a nice fix, however we cannot rely on the assumption that requires_grad is always propagated throughout the trace and I think we should move away from that assumption unless we make sure that propagation is always guaranteed.

This fix doesn't rely on requires_grad being propagated correctly but on whether or not the gradient was returned from backward trace.

lightning-thunder/thunder/executors/torch_autograd.py

Lines 271 to 280 in 1ef2b94

    
           # Update the backward trace to only compute gradients for the 
        
           # inputs that require gradients 
        
           assert bw_trace.bound_symbols[-1].sym.id == PrimIDs.RETURN 
        
           filtered_grads = tuple( 
        
               (arg_grad if requires_grad else None) 
        
               for arg_grad, requires_grad in utils.safe_zip(bw_trace.bound_symbols[-1].args[0], requires_grad_mask) 
        
           ) 
        
           # autograd.Function.backward expects a flat tuple of gradients 
        
           bw_trace.bound_symbols[-1] = replace(bw_trace.bound_symbols[-1], args=(filtered_grads,))

If the gradient is not returned from the backward trace, then we just update both forward and backward trace so that we don't save FP8 copy for backward and wgrad is not computed respectively.

A more involved alternative would be to pickup on the runtime proxy idea

As far as I can tell RuntimeProxy idea #1599, will just ban us to fetch requires_grad from intermediate TensorProxy. However, it won't fix the problem of correctly propagating it #1768. I could be wrong though cc: @IvanYashchuk as author of #1599 to clarify.

into te-frozen-weights

for more information, see https://pre-commit.ci

…ightning-thunder into te-frozen-weights

for more information, see https://pre-commit.ci

kshitij12345 · 2025-06-05T15:12:37Z

TODO: Understand the interaction of this PR with #2102

nvMelissa · 2025-06-17T16:46:59Z

@kshitij12345 - this PR is ready, yes? Who needs to approve this please?

kshitij12345 · 2025-06-18T08:49:31Z

Need to update this PR to work correctly with the changes from #2102.

Also, TE integration is broken post #2102, so we need #2222 to be merged first and then merge this PR.

Moving it back to draft to avoid confusion.

into te-frozen-weights

for more information, see https://pre-commit.ci

kshitij12345 added 4 commits March 5, 2025 16:04

TE: don't compute wgrad when weights are frozen

f6248c4

add test

5226e99

Merge branch 'main' of https://github.com/Lightning-AI/lightning-thunder

f0b4632

into te-frozen-weights

add transform to set requires_grad based on backward trace

f17039b

kshitij12345 requested review from mruberry, lantiga and t-vi as code owners May 26, 2025 13:25

kshitij12345 marked this pull request as draft May 26, 2025 13:25

pre-commit-ci bot and others added 4 commits May 26, 2025 13:26

[pre-commit.ci] auto fixes from pre-commit.com hooks

f758d77

for more information, see https://pre-commit.ci

add comments

7d0f02a

Merge branch 'te-frozen-weights' of https://github.com/kshitij12345/l…

d8ba6ea

…ightning-thunder into te-frozen-weights

update

ebf6a59

kshitij12345 requested review from riccardofelluga and IvanYashchuk May 27, 2025 07:53

add comment

1af14ad

riccardofelluga reviewed May 28, 2025

View reviewed changes

kshitij12345 added the TransformerEngine label May 28, 2025

kshitij12345 and others added 7 commits June 3, 2025 05:24

Merge branch 'main' of https://github.com/Lightning-AI/lightning-thunder

4a98cc0

into te-frozen-weights

use kwargs for nicer trace

f3f4dd6

add input_requires_grad

db8fa76

[pre-commit.ci] auto fixes from pre-commit.com hooks

d799e19

for more information, see https://pre-commit.ci

update test

2f4cc55

Merge branch 'te-frozen-weights' of https://github.com/kshitij12345/l…

0b65134

…ightning-thunder into te-frozen-weights

[pre-commit.ci] auto fixes from pre-commit.com hooks

c970c3c

for more information, see https://pre-commit.ci

kshitij12345 marked this pull request as ready for review June 4, 2025 23:29

kshitij12345 marked this pull request as draft June 18, 2025 08:49

kkalambarkar and others added 4 commits June 27, 2025 07:54

Merge branch 'main' of https://github.com/Lightning-AI/lightning-thunder

9cac5bb

into te-frozen-weights

update based on the latest changes

23ba264

[pre-commit.ci] auto fixes from pre-commit.com hooks

bde0404

for more information, see https://pre-commit.ci

Merge branch 'main' into te-frozen-weights

d6d119d

kshitij12345 marked this pull request as ready for review June 30, 2025 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TE: Fix redundant compute for PEFT using transform #2138

TE: Fix redundant compute for PEFT using transform #2138

Uh oh!

kshitij12345 commented May 26, 2025 •

edited

Loading

Uh oh!

riccardofelluga left a comment

Uh oh!

Uh oh!

riccardofelluga May 28, 2025

Uh oh!

kshitij12345 commented May 28, 2025

Uh oh!

kshitij12345 commented Jun 5, 2025

Uh oh!

nvMelissa commented Jun 17, 2025

Uh oh!

kshitij12345 commented Jun 18, 2025

Uh oh!

Uh oh!

TE: Fix redundant compute for PEFT using transform #2138

Are you sure you want to change the base?

TE: Fix redundant compute for PEFT using transform #2138

Uh oh!

Conversation

kshitij12345 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

riccardofelluga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

riccardofelluga May 28, 2025

Choose a reason for hiding this comment

Uh oh!

kshitij12345 commented May 28, 2025

Uh oh!

kshitij12345 commented Jun 5, 2025

Uh oh!

nvMelissa commented Jun 17, 2025

Uh oh!

kshitij12345 commented Jun 18, 2025

Uh oh!

Uh oh!

kshitij12345 commented May 26, 2025 •

edited

Loading