fix(runtimes): fix missing dependency in torchtune trainer image. #2887

Electronic-Waste · 2025-10-11T08:42:03Z

What this PR does / why we need it:

This PR fix the missing dependency in torchtune trainer image by installing build-essentials.

After fixing this error, the torchtune trainer works as expected on my development cluster:

Setting manual seed to local seed 1978145440. Local seed is seed + rank = 1978145440 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
        GPU peak memory allocation: 2.33 GiB
        GPU peak memory reserved: 2.34 GiB
        GPU peak memory active: 2.33 GiB
Tokenizer is initialized from file.
In-backward optimizers are set up.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1760171624.txt
Generating train split: 52002 examples [00:00, 243423.09 examples/s]
No learning rate scheduler configured. Using constant learning rate.
 Profiling disabled.
 Profiler config after instantiation: {'enabled': False}
1|674|Loss: 1.6298757791519165:   5%|▌      %

/cc @kubeflow/kubeflow-trainer-team

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2884

Checklist:

Docs included if any changes are user facing

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow · 2025-10-11T08:42:06Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-trainer-team.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

This PR fix the missing dependency in torchtune trainer image by installing build-essentials.

After fixing this error, the torchtune trainer works as expected on my development cluster:
Setting manual seed to local seed 1978145440. Local seed is seed + rank = 1978145440 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
       GPU peak memory allocation: 2.33 GiB
       GPU peak memory reserved: 2.34 GiB
       GPU peak memory active: 2.33 GiB
Tokenizer is initialized from file.
In-backward optimizers are set up.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1760171624.txt
Generating train split: 52002 examples [00:00, 243423.09 examples/s]
No learning rate scheduler configured. Using constant learning rate.
Profiling disabled.
Profiler config after instantiation: {'enabled': False}
1|674|Loss: 1.6298757791519165:   5%|▌      %
/cc @kubeflow/kubeflow-trainer-team

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2884

Checklist:

Docs included if any changes are user facing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Electronic-Waste · 2025-10-11T08:43:04Z

Please can you add ok-to-test-gpu-runner label to this PR? @andreyvelich

/assign @kubeflow/kubeflow-trainer-team

coveralls · 2025-10-11T08:46:37Z

Pull Request Test Coverage Report for Build 18427110343

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 54.488%

Totals
Change from base Build 18375028578:	0.0%
Covered Lines:	1214
Relevant Lines:	2228

💛 - Coveralls

andreyvelich

Looks to be working now, thank you @Electronic-Waste!
/lgtm
/approve

google-oss-prow · 2025-10-13T22:08:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix(runtimes): fix missing dependency in torchtune trainer image.

bdd33c9

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot added the size/XS label Oct 11, 2025

Electronic-Waste mentioned this pull request Oct 11, 2025

feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 #2832

Open

1 task

andreyvelich added the ok-to-test-gpu-runner label Oct 13, 2025

andreyvelich reviewed Oct 13, 2025

View reviewed changes

google-oss-prow bot assigned andreyvelich Oct 13, 2025

google-oss-prow bot added the lgtm label Oct 13, 2025

google-oss-prow bot added the approved label Oct 13, 2025

google-oss-prow bot merged commit 3ac1307 into kubeflow:master Oct 13, 2025
32 of 34 checks passed

google-oss-prow bot added this to the v2.1 milestone Oct 13, 2025

Electronic-Waste deleted the fix/torchtune-c-compiler branch October 14, 2025 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(runtimes): fix missing dependency in torchtune trainer image. #2887

fix(runtimes): fix missing dependency in torchtune trainer image. #2887

Electronic-Waste commented Oct 11, 2025

Uh oh!

google-oss-prow bot commented Oct 11, 2025

Uh oh!

Electronic-Waste commented Oct 11, 2025

Uh oh!

coveralls commented Oct 11, 2025

Uh oh!

andreyvelich left a comment

Uh oh!

google-oss-prow bot commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(runtimes): fix missing dependency in torchtune trainer image. #2887

fix(runtimes): fix missing dependency in torchtune trainer image. #2887

Conversation

Electronic-Waste commented Oct 11, 2025

Uh oh!

google-oss-prow bot commented Oct 11, 2025

Uh oh!

Electronic-Waste commented Oct 11, 2025

Uh oh!

coveralls commented Oct 11, 2025

Pull Request Test Coverage Report for Build 18427110343

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants