-
Notifications
You must be signed in to change notification settings - Fork 98
Open
Description
创建了两个pod,各设置45%的GPU核使用率和45%的显存使用率限制。同时在两个pod上执行深度学习训练任务,报NCCL错误,排查后确定为资源分配不足导致。只使用一个pod执行相同任务时,使用nvidia-smi发现GPU使用率到达100%,GPU核使用率限制失效。
附创建pod时使用的yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod1
spec:
containers:
- name: my-gpu-container
image: nvcr.io/nvidia/pytorch:23.08-py3
command: ["/bin/bash","-c","sleep 86400"]
env:- name: OUT_DIR
value: "./" - name: NCCL_DEBUG
value: "INFO"
resources:
limits:
memory: "20Gi" # 设置内存限制为 20GB
cpu: 15 # 设置 CPU 限制为 20 个核心
nvidia.com/gpu: 4 # 请求 2 张 GPU
nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
volumeMounts: - mountPath: /datadrive
name: datadrive-volume
volumes:
- name: OUT_DIR
- name: datadrive-volume
hostPath: # 使用 hostPath 进行绑定挂载
path: / # 挂载宿主机的根目录
hostIPC: true # 使用主机的 IPC namespace
hostNetwork: true # 使用主机的网络 namespace
hostPID: true # 如果需要,可以启用主机的 PID namespace
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod2
spec:
containers:
- name: my-gpu-container
image: nvcr.io/nvidia/pytorch:23.08-py3
command: ["/bin/bash","-c","sleep 86400"]
env:- name: OUT_DIR
value: "./" - name: NCCL_DEBUG
value: "INFO"
resources:
limits:
memory: "20Gi" # 设置内存限制为 20GB
cpu: 15 # 设置 CPU 限制为 20 个核心
nvidia.com/gpu: 4 # 请求 2 张 GPU
nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
volumeMounts: - mountPath: /datadrive
name: datadrive-volume
volumes:
- name: OUT_DIR
- name: datadrive-volume
hostPath: # 使用 hostPath 进行绑定挂载
path: / # 挂载宿主机的根目录
hostIPC: true # 使用主机的 IPC namespace
hostNetwork: true # 使用主机的网络 namespace
hostPID: true # 如果需要,可以启用主机的 PID namespace
Metadata
Metadata
Assignees
Labels
No labels