Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm #44

Open
shuli12318 opened this issue Mar 20, 2024 · 4 comments
Open

slurm #44

shuli12318 opened this issue Mar 20, 2024 · 4 comments

Comments

@shuli12318
Copy link

i want to know how to run the .sh without slurm to segmentation, because the slurm is hard to use.

@shuli12318
Copy link
Author

单机单卡、单机多卡、不使用slurm实现语义分割。
1 数据集导入需要在semantic_segmentation文件夹下。

2 batch_size参数修改,在semantic_segmentation/configs/base/datasets/ade20k_sfpn.py中,修改为samples_per_gpu=8, workers_per_gpu=8即可。

3 lr和iters参数修改,在semantic_segmentation/configs/ade20k/sfpn.biformer_small.py中,根据经验公式gpusbatchsizeiters=定值,lr=定值(gpus*batchsize)修改lr和iters即可。

4 多卡训练参数修改,只需修改slurm.sh文件即可。以下是我没有使用slurm的多卡训练脚本。
#!/usr/bin/env bash
PARTITION=mediasuper
NOW=$(date '+%m-%d-%H:%M:%S')
JOB_NAME=${MODEL}

CONFIG_DIR=configs/ade20k
MODEL=sfpn.biformer_small
CKPT=pretrained/biformer_small_best.pth
CONFIG=${CONFIG_DIR}/${MODEL}.py
OUTPUT_DIR=../outputs/seg
WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW}
mkdir -p ${WORK_DIR}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \

export CUDA_VISIBLE_DEVICES=0,1,2,3 # 暴露GPU编号
torchrun
--nproc_per_node=4 \ # 这个就是GPU数量
--master_port=29501
train.py --config=${CONFIG}
--launcher="pytorch" \ # 这里不使用slurm启动器
--work-dir=${WORK_DIR}
--options model.pretrained=${CKPT}
&> ${WORK_DIR}/train.${JOB_NAME}.log &

@taoxingwang
Copy link

你好请问单机单卡也是这么设置的吗?

@shuli12318
Copy link
Author

你好请问单机单卡也是这么设置的吗?
把export CUDA_VISIBLE_DEVICES=0,1,2,3 # 暴露GPU编号 改成0
把--nproc_per_node=4 \ # 这个就是GPU数量 改成1
就行了

还要导入一下ckpt权重文件,在作者的readme里面有,新建文件夹pretrained,下载到里面即可。

@20191844308
Copy link

请问按这个改完后报错

  • torchrun
    usage: torchrun [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE] [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS]
    [--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}] [--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS] [-t TEE] [--node-rank NODE_RANK]
    [--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
    training_script ...
    torchrun: error: the following arguments are required: training_script, training_script_args
  • --nproc_per_node=4 ' '
    train.sh: line 28: --nproc_per_node=4: command not found
  • --master_port=29501
    train.sh: line 29: --master_port=29501: command not found
    怎么回事

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants