Skip to content

Conversation

Electronic-Waste
Copy link
Member

What this PR does / why we need it:

This PR fix the missing dependency in torchtune trainer image by installing build-essentials.

After fixing this error, the torchtune trainer works as expected on my development cluster:

Setting manual seed to local seed 1978145440. Local seed is seed + rank = 1978145440 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
        GPU peak memory allocation: 2.33 GiB
        GPU peak memory reserved: 2.34 GiB
        GPU peak memory active: 2.33 GiB
Tokenizer is initialized from file.
In-backward optimizers are set up.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1760171624.txt
Generating train split: 52002 examples [00:00, 243423.09 examples/s]
No learning rate scheduler configured. Using constant learning rate.
 Profiling disabled.
 Profiler config after instantiation: {'enabled': False}
1|674|Loss: 1.6298757791519165:   5%|▌      %

/cc @kubeflow/kubeflow-trainer-team

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2884

Checklist:

  • Docs included if any changes are user facing

Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-trainer-team.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

This PR fix the missing dependency in torchtune trainer image by installing build-essentials.

After fixing this error, the torchtune trainer works as expected on my development cluster:

Setting manual seed to local seed 1978145440. Local seed is seed + rank = 1978145440 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
       GPU peak memory allocation: 2.33 GiB
       GPU peak memory reserved: 2.34 GiB
       GPU peak memory active: 2.33 GiB
Tokenizer is initialized from file.
In-backward optimizers are set up.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1760171624.txt
Generating train split: 52002 examples [00:00, 243423.09 examples/s]
No learning rate scheduler configured. Using constant learning rate.
Profiling disabled.
Profiler config after instantiation: {'enabled': False}
1|674|Loss: 1.6298757791519165:   5%|▌      %

/cc @kubeflow/kubeflow-trainer-team

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2884

Checklist:

  • Docs included if any changes are user facing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member Author

Please can you add ok-to-test-gpu-runner label to this PR? @andreyvelich

/assign @kubeflow/kubeflow-trainer-team

@coveralls
Copy link

Pull Request Test Coverage Report for Build 18427110343

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 54.488%

Totals Coverage Status
Change from base Build 18375028578: 0.0%
Covered Lines: 1214
Relevant Lines: 2228

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be working now, thank you @Electronic-Waste!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3ac1307 into kubeflow:master Oct 13, 2025
32 of 34 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.1 milestone Oct 13, 2025
@Electronic-Waste Electronic-Waste deleted the fix/torchtune-c-compiler branch October 14, 2025 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing C Compiler in TorchTune Trainer

3 participants