Skip to content

无法实现多个pod在同一张显卡上执行深度学习任务 #44

@Lorenz5622

Description

@Lorenz5622

创建了两个pod,各设置45%的GPU核使用率和45%的显存使用率限制。同时在两个pod上执行深度学习训练任务,报NCCL错误,排查后确定为资源分配不足导致。只使用一个pod执行相同任务时,使用nvidia-smi发现GPU使用率到达100%,GPU核使用率限制失效。

附创建pod时使用的yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod1
spec:
containers:

  • name: my-gpu-container
    image: nvcr.io/nvidia/pytorch:23.08-py3
    command: ["/bin/bash","-c","sleep 86400"]
    env:
    • name: OUT_DIR
      value: "./"
    • name: NCCL_DEBUG
      value: "INFO"
      resources:
      limits:
      memory: "20Gi" # 设置内存限制为 20GB
      cpu: 15 # 设置 CPU 限制为 20 个核心
      nvidia.com/gpu: 4 # 请求 2 张 GPU
      nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
      nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
      volumeMounts:
    • mountPath: /datadrive
      name: datadrive-volume
      volumes:
  • name: datadrive-volume
    hostPath: # 使用 hostPath 进行绑定挂载
    path: / # 挂载宿主机的根目录
    hostIPC: true # 使用主机的 IPC namespace
    hostNetwork: true # 使用主机的网络 namespace
    hostPID: true # 如果需要,可以启用主机的 PID namespace

apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod2
spec:
containers:

  • name: my-gpu-container
    image: nvcr.io/nvidia/pytorch:23.08-py3
    command: ["/bin/bash","-c","sleep 86400"]
    env:
    • name: OUT_DIR
      value: "./"
    • name: NCCL_DEBUG
      value: "INFO"
      resources:
      limits:
      memory: "20Gi" # 设置内存限制为 20GB
      cpu: 15 # 设置 CPU 限制为 20 个核心
      nvidia.com/gpu: 4 # 请求 2 张 GPU
      nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
      nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
      volumeMounts:
    • mountPath: /datadrive
      name: datadrive-volume
      volumes:
  • name: datadrive-volume
    hostPath: # 使用 hostPath 进行绑定挂载
    path: / # 挂载宿主机的根目录
    hostIPC: true # 使用主机的 IPC namespace
    hostNetwork: true # 使用主机的网络 namespace
    hostPID: true # 如果需要,可以启用主机的 PID namespace

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions