Skip to content

关于 vGPU 分配 Pods 数量限制不符合预期的问题请教 #45

@alicfeng

Description

@alicfeng

vgpu-device-plugin 配置

apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {
                "name": "m5-cloudinfra-online02",
                "devicememoryscaling": 1,
                "devicesplitcount": 24,
                "disablecorelimit","true",
                "migstrategy":"none"
            }
        ]
    }
kind: ConfigMap

节点显卡情况,一张 3090 的显卡,共 24G 显存,计划是启动 24 个 Pod,每个 Pod 分配 1500M 显存。

遇到问题:节点只能启动 8 个 Pod,其它的 Pod 不能被调度,想请教一下作者老哥是我哪里配置的姿势不对吗?还是 3090 消费型显卡不支持呀?

root@k8s-master-gpu-3090:~# kubectl get po -n inference | grep "i-service-"
ai-service-1-8yzevqj5pp-775bb7b687-t6fxv    1/1     Running   0          33h 这个没有分配显GPU,仅有CPU
ai-service-30-ztfy9fb0ou-69f79c5d7b-php68   1/1     Running   0          55m
ai-service-31-072z2gvhd6-876bcf789-p9swg    1/1     Running   0          55m
ai-service-32-m321gi6s76-68cb4d745d-79jr7   0/1     Pending   0          12m
ai-service-33-ogppkk7j9a-767b775f67-gfdzx   1/1     Running   0          55m
ai-service-34-yi6in6yxd3-6c7d7f5cfb-w2tvn   1/1     Running   0          55m
ai-service-35-e3e1gxur3c-7cb894c68f-cqxjv   1/1     Running   0          55m
ai-service-36-lp4oinnc4o-797f6d454c-48sgn   0/1     Pending   0          12m
ai-service-37-hf1j6dshed-6f6bfbcf5b-bpdn4   1/1     Running   0          55m
ai-service-38-ehncl0weuv-fbc89f76c-2nhm9    0/1     Pending   0          102s
ai-service-38-ehncl0weuv-fbc89f76c-56zkw    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-5dnql    1/1     Running   0          55m
ai-service-38-ehncl0weuv-fbc89f76c-5mk2d    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-dfz9g    0/1     Pending   0          102s
ai-service-38-ehncl0weuv-fbc89f76c-kwkbf    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-n4twc    0/1     Pending   0          2m12s
ai-service-38-ehncl0weuv-fbc89f76c-s9fs8    0/1     Pending   0          2m12s
ai-service-38-ehncl0weuv-fbc89f76c-wns2q    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-x2pnh    0/1     Pending   0          2m12s
ai-service-39-05iiqpvtwc-ddd74d659-qlw45    1/1     Running   0          12m


kubectl describe po -n inference ai-service-38-ehncl0weuv-fbc89f76c-5mk2d

Events:
  Type     Reason            Age   From           Message
  ----     ------            ----  ----           -------
  Warning  FailedScheduling  4m3s  4pd-scheduler  0/3 nodes are available: 2 node unregisterd. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod.


Capacity:
  cpu:                20
  ephemeral-storage:  101430960Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             41088492Ki
  nvidia.com/gpu:     24
  pods:               110
Allocatable:
  cpu:                20
  ephemeral-storage:  93478772582
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             40986092Ki
  nvidia.com/gpu:     24
  pods:               110



Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                9700m (48%)  10500m (52%)
  memory             350Mi (0%)   1000Mi (2%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     9            9
Events:              <none>

显卡信息

Sat Apr 19 12:01:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:0A:00.0 Off |                  N/A |
| 30%   31C    P8             19W /  350W |    8134MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    983567      C   ...local/ai/.venv/bin/python       1012MiB |
|    0   N/A  N/A    984134      C   ...local/ai/.venv/bin/python       1018MiB |
|    0   N/A  N/A    984728      C   ...local/ai/.venv/bin/python       1008MiB |
|    0   N/A  N/A    985224      C   ...local/ai/.venv/bin/python       1010MiB |
|    0   N/A  N/A    985615      C   ...local/ai/.venv/bin/python       1010MiB |
|    0   N/A  N/A    986369      C   ...local/ai/.venv/bin/python       1008MiB |
|    0   N/A  N/A    986605      C   ...local/ai/.venv/bin/python       1012MiB |
|    0   N/A  N/A   1002357      C   ...local/ai/.venv/bin/python       1010MiB |
+-----------------------------------------------------------------------------------------+

我的理解应该是可以启动 24 个 Pod 的,但实质上只能运行 8 个。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions