Skip to content

Slurm: broken (hangs on connecting to worker 1 out of <N>) #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MilesCranmer opened this issue Feb 10, 2024 · 12 comments
Open

Slurm: broken (hangs on connecting to worker 1 out of <N>) #69

MilesCranmer opened this issue Feb 10, 2024 · 12 comments
Labels
bug Something isn't working manager: SLURM

Comments

@MilesCranmer
Copy link

MilesCranmer commented Feb 10, 2024

Is anybody maintaining this package? I haven't been able to get Slurm working for the past month or so... It just ends up stalling on connecting to worker 1 out of <N>:

julia> p = addprocs_slurm(2)
connecting to worker 1 out of <N>

The exact same code seemed to work a month ago. This is slurm 22.05.8. Not sure if this new version is breaking things or not.

@MilesCranmer MilesCranmer added the bug Something isn't working label Feb 10, 2024
@Moelf
Copy link

Moelf commented Feb 11, 2024

I don't have access to a slurm now, but it would be useful to know if a previous version was okay

@MilesCranmer
Copy link
Author

Not sure I know how to test other versions of slurm... I am stuck with whatever my institute cluster has installed

@Moelf
Copy link

Moelf commented Feb 11, 2024

I guess I this case ask the HPC admin see if they know anything that might be the problem

@kescobo
Copy link

kescobo commented Feb 11, 2024

Ugh, @MilesCranmer that's annoying. I also don't currently have access to a SLURM cluster... this is the kind of thing that it would be nice if we had JuliaParallel/ClusterManagers.jl#105 that could test on different schedulers 🤦

Would definitely be worth checking in with the cluster admin to see if SLURM was recently updated so we can at least know that that's the culprit.

@MilesCranmer
Copy link
Author

We could if this PR gets finished JuliaParallel/ClusterManagers.jl#193

@cnrrobertson
Copy link

@kescobo I can confirm that this issue started for me after an upgrade to Slurm on my institution's cluster. Unfortunately, I don't know what the previous version was, but currently the version is 23.11.1

@kescobo
Copy link

kescobo commented Mar 5, 2024

From the initial post, it looks like it goes back to v22

This is slurm 22.05.8

Does anyone know if SLURM follows SemVer?

@jewh
Copy link

jewh commented Nov 28, 2024

I'm getting the same error with same behaviour - code works fine otherwise now broken. Slurm version is slurm 20.11.7 and cluster admin confirms there's been no upgrade over the past year.

@DilumAluthge
Copy link
Member

@MilesCranmer Does the SlurmClusterManager.jl package work for you?

@DilumAluthge DilumAluthge changed the title Slurm broken Slurm: broken (hangs on connecting to worker 1 out of <N>) Jan 2, 2025
@DilumAluthge
Copy link
Member

Bump @MilesCranmer - I just wanted to check if the SlurmClusterManager.jl package works for you?

If so, I think we can add a note to the README recommending that users use SlurmClusterManager.jl, and then we can close this issue.

@MilesCranmer
Copy link
Author

Yes, thanks (I think once the project activation stuff merges, it should be good to go)

@DilumAluthge DilumAluthge transferred this issue from JuliaParallel/ClusterManagers.jl Feb 15, 2025
@DilumAluthge
Copy link
Member

@MilesCranmer Just closing the loop here - can you confirm that your issue been resolved by using the latest release (v1.0.0) of the SlurmClusterManager.jl package?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working manager: SLURM
Projects
None yet
Development

No branches or pull requests

6 participants