Skip to content

Conversation

@davidtweedle
Copy link
Contributor

It seems that the problem affecting the pytorch ogbg workloads (but only if they run for some length of time) has to do with jax/xla cpu compilation of the metrics computation. By converting the jax arrays to numpy, hopefully this can be avoided. The next step is to test on schedule free and shampoo, which I hope to do very soon.

It seems that the problem affecting the pytorch ogbg workloads (but only if they run for some length of time) has to do with jax/xla cpu compilation of the metrics computation. By converting the jax arrays to numpy, hopefully this can be avoided. The next step is to test on schedule free and shampoo, which I hope to do very soon.
@github-actions
Copy link

github-actions bot commented Jun 5, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@davidtweedle
Copy link
Contributor Author

You can see here that schedule free now completes the run with the changes.
https://pastebin.com/RgqMqgkb

@davidtweedle
Copy link
Contributor Author

Tested again but this time only replacing the call to jax for sigmoid with numpy sigmoid. This also avoids the crash. So the problem seems to be calling jax sigmoid in pytorch.

The problem with torchrun and jax seems to be caused by jax.nn.sigmoid.
Changed from lambda expression which pylint doesn't like.
Defined np sigmoid inside use_pytorch_ddp
Added white space before and after sigmoid_np
Fix white space
@priyakasimbeg priyakasimbeg changed the base branch from dev to ogbg_fix June 26, 2025 19:00
@priyakasimbeg priyakasimbeg changed the title [WIP] Update metrics.py - fix for ogbg pytorch Update metrics.py - fix for ogbg pytorch Jun 26, 2025
@priyakasimbeg priyakasimbeg changed the base branch from ogbg_fix to dev June 26, 2025 19:03
@priyakasimbeg priyakasimbeg marked this pull request as ready for review June 26, 2025 19:05
@priyakasimbeg priyakasimbeg requested a review from a team as a code owner June 26, 2025 19:05
@priyakasimbeg priyakasimbeg self-requested a review June 26, 2025 19:06
@priyakasimbeg priyakasimbeg merged commit 2f3c23c into mlcommons:dev Jun 26, 2025
16 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants