-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Is your feature request related to a problem? Please describe.
A common complaint is that while it is relatively easy to use the VSCode debugger from a login node on an HPC cluster, doing so from a compute node can be quite tricky, but there are ways to do it. One possibility is to start an interactive job on a compute node and open a VSCode server on it via remote SSH. However, this is often not possible or allowed, and other workarounds are needed, see for example here: https://github.com/KaijingOfficial/Vscode-Debug-Python-on-Slurm (credit to @dnerini for finding it and testing it).
It would be great if we could integrate this into anemoi-utils. Thoughts?
Describe the solution you'd like
Such a debugging functionality could be brought to the entire anemoi framework by leveraging the centralization of the CLI abstractions provided by anemoi-utils. We could for example inject the debugpy code somewhere in the Command base class or the cli_main and activate it behind a --debugpy flag or similar. Below I provided a small example of how we can inject the debugpy initialization before running any program.
This is an example script (it's just to convey the idea, it won't work in your setup unless you've already set up your launch.json file in a specific way - also, the actual implementation in anemoi-utils hopefully won't be this hacky 😉 ) that you can use to magically set up debugging for any anemoi-x command on a SLURM compute node:
"""Wrapper script to attach debugpy to anemoi programs running on SLURM clusters.
Description:
This script sets up a debugpy server on the master node of a SLURM job
and configures VSCode by modifying the .vscode/launch.json file to connect
to the debugpy server. It then imports and runs the specified anemoi module's
main function with the provided command-line arguments.
Example usage:
srun -u python scripts/debugging/attach_slurm.py anemoi-training train ...
Inspired from: https://github.com/KaijingOfficial/Vscode-Debug-Python-on-Slurm
"""
import debugpy
import traceback
import time
import os
import json
import importlib
import sys
import argparse
def setup_debug(is_main_process, port=10099):
master_addr = os.environ["SLURM_NODELIST"].split(",")[0]
# Auto-configure .vscode/launch.json
launch_json_path = os.path.join(".vscode", "launch.json")
os.makedirs(os.path.dirname(launch_json_path), exist_ok=True)
with open(launch_json_path, "r") as f:
existing_config = json.load(f)
if existing_config.get("configurations"):
for i, config in enumerate(existing_config["configurations"]):
if config.get("name") == "Debug: attach to SLURM job":
existing_config["configurations"][i]["connect"].update({
"port": port,
"host": master_addr
})
else:
raise ValueError("Expected an existing launch.json configuration")
with open(launch_json_path, "w") as f:
json.dump(existing_config, f, indent=4)
if is_main_process: # Master process handler
print(f"Debug portal active on {master_addr}:{port}", flush=True)
debugpy.listen((master_addr, port))
debugpy.wait_for_client()
print("Debugger linked!", flush=True)
if __name__ == "__main__":
# parse the command line arguments (as they would normally be used)
parser = argparse.ArgumentParser(description="Wrapper to attach debugpy to anemoi programs.")
parser.add_argument("module", type=str, help="The anemoi module to run (e.g., anemoi-training, anemoi-inference, etc.)")
parser.add_argument("subcommand", type=str, help="The subcommand to run (e.g., train, run, create, etc.)")
parser.add_argument("extra_args", nargs=argparse.REMAINDER, help="Additional arguments for the subcommand")
args = parser.parse_args()
# import the correct entry point module
module_name = args.module.replace("-", ".")
entry_module = importlib.import_module(f"{module_name}.__main__")
main_func = entry_module.main
# determine if this is the main process
rank = int(os.environ.get("SLURM_PROCID", "0"))
is_main_process = (rank == 0)
# set up debugpy
print(f"Process rank {rank} - Debug mode: {'ON' if is_main_process else 'OFF'}", flush=True)
setup_debug(is_main_process)
# execute the main function with the provided arguments
cmd_args = [args.subcommand] + args.extra_args
sys.argv = [args.module] + cmd_args
if is_main_process:
# rank-0: let exceptions propagate
main_func()
else:
try:
main_func()
except Exception as e:
print(f"Rank {rank} encountered exception: {e}", flush=True)
traceback.print_exc()
print(f"Pausing process on rank {rank} to prevent job termination.", flush=True)
while True:
time.sleep(1)Organisation
MeteoSwiss
Metadata
Metadata
Assignees
Labels
Type
Projects
Status