Skip to content

Unique name and logfile for SLURM-launched A100 runners #525

Open
@yhtang

Description

@yhtang

This workflow run exposed an issue with our current workflow: both JAX and Pallas unit test calls the _runner_ondemand_slurm.yaml workflow to create A100 runners. If two such calls happens in fast succession, they ended up creating two runners that may be scheduled by the SLURM cluster at the same time while having identical names (A100-${{ github_run_id }}), thus causing issue for the actual job to properly landed in the runner (more detail to be discovered here).

To fix potential conflicts between runners launched this way, the runner need to have different names, i.e. having a UUID as part of the name, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions