Skip to content

Debugging utility for VSCode and SLURM clusters #240

@frazane

Description

@frazane

Is your feature request related to a problem? Please describe.

A common complaint is that while it is relatively easy to use the VSCode debugger from a login node on an HPC cluster, doing so from a compute node can be quite tricky, but there are ways to do it. One possibility is to start an interactive job on a compute node and open a VSCode server on it via remote SSH. However, this is often not possible or allowed, and other workarounds are needed, see for example here: https://github.com/KaijingOfficial/Vscode-Debug-Python-on-Slurm (credit to @dnerini for finding it and testing it).

It would be great if we could integrate this into anemoi-utils. Thoughts?

Describe the solution you'd like

Such a debugging functionality could be brought to the entire anemoi framework by leveraging the centralization of the CLI abstractions provided by anemoi-utils. We could for example inject the debugpy code somewhere in the Command base class or the cli_main and activate it behind a --debugpy flag or similar. Below I provided a small example of how we can inject the debugpy initialization before running any program.

This is an example script (it's just to convey the idea, it won't work in your setup unless you've already set up your launch.json file in a specific way - also, the actual implementation in anemoi-utils hopefully won't be this hacky 😉 ) that you can use to magically set up debugging for any anemoi-x command on a SLURM compute node:

"""Wrapper script to attach debugpy to anemoi programs running on SLURM clusters.

Description:
    This script sets up a debugpy server on the master node of a SLURM job
    and configures VSCode by modifying the .vscode/launch.json file to connect
    to the debugpy server. It then imports and runs the specified anemoi module's
    main function with the provided command-line arguments.

Example usage:
    srun -u python scripts/debugging/attach_slurm.py anemoi-training train ...

    
Inspired from: https://github.com/KaijingOfficial/Vscode-Debug-Python-on-Slurm
"""

import debugpy
import traceback
import time
import os
import json
import importlib
import sys
import argparse

def setup_debug(is_main_process, port=10099):
    master_addr = os.environ["SLURM_NODELIST"].split(",")[0]
    
    # Auto-configure .vscode/launch.json
    launch_json_path = os.path.join(".vscode", "launch.json")
    os.makedirs(os.path.dirname(launch_json_path), exist_ok=True)
        
    with open(launch_json_path, "r") as f:
        existing_config = json.load(f)
    
    if existing_config.get("configurations"):
        for i, config in enumerate(existing_config["configurations"]):
            if config.get("name") == "Debug: attach to SLURM job":
                existing_config["configurations"][i]["connect"].update({
                    "port": port,
                    "host": master_addr
                })
    else:
        raise ValueError("Expected an existing launch.json configuration")
    
    with open(launch_json_path, "w") as f:
        json.dump(existing_config, f, indent=4)

    if is_main_process:  # Master process handler
        print(f"Debug portal active on {master_addr}:{port}", flush=True)
        debugpy.listen((master_addr, port))
        debugpy.wait_for_client()
        print("Debugger linked!", flush=True)


if __name__ == "__main__":


    # parse the command line arguments (as they would normally be used)
    parser = argparse.ArgumentParser(description="Wrapper to attach debugpy to anemoi programs.")
    parser.add_argument("module", type=str, help="The anemoi module to run (e.g., anemoi-training, anemoi-inference, etc.)")
    parser.add_argument("subcommand", type=str, help="The subcommand to run (e.g., train, run, create, etc.)")
    parser.add_argument("extra_args", nargs=argparse.REMAINDER, help="Additional arguments for the subcommand")
    args = parser.parse_args()

    # import the correct entry point module
    module_name = args.module.replace("-", ".")
    entry_module = importlib.import_module(f"{module_name}.__main__")
    main_func = entry_module.main

    # determine if this is the main process
    rank = int(os.environ.get("SLURM_PROCID", "0"))
    is_main_process = (rank == 0)
    
    # set up debugpy
    print(f"Process rank {rank} - Debug mode: {'ON' if is_main_process else 'OFF'}", flush=True)
    setup_debug(is_main_process)

    # execute the main function with the provided arguments
    cmd_args = [args.subcommand] + args.extra_args
    sys.argv = [args.module] + cmd_args

    if is_main_process:
        # rank-0: let exceptions propagate
        main_func()
    else:
        try:
            main_func()
        except Exception as e:
            print(f"Rank {rank} encountered exception: {e}", flush=True)
            traceback.print_exc()
            print(f"Pausing process on rank {rank} to prevent job termination.", flush=True)
            while True:
                time.sleep(1)

Organisation

MeteoSwiss

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status

To be triaged

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions