关于 vGPU 分配 Pods 数量限制不符合预期的问题请教


vgpu-device-plugin 配置

```yaml
apiVersion: v1
data:
  config.json: |
    {
        "nodeconfig": [
            {
                "name": "m5-cloudinfra-online02",
                "devicememoryscaling": 1,
                "devicesplitcount": 24,
                "disablecorelimit","true",
                "migstrategy":"none"
            }
        ]
    }
kind: ConfigMap
```

节点显卡情况，一张 3090 的显卡，共 24G 显存，计划是启动 24 个 Pod，每个 Pod 分配 1500M 显存。

遇到问题：节点只能启动 8 个 Pod，其它的 Pod 不能被调度，想请教一下作者老哥是我哪里配置的姿势不对吗？还是 3090 消费型显卡不支持呀？

```shell
root@k8s-master-gpu-3090:~# kubectl get po -n inference | grep "i-service-"
ai-service-1-8yzevqj5pp-775bb7b687-t6fxv    1/1     Running   0          33h 这个没有分配显GPU，仅有CPU
ai-service-30-ztfy9fb0ou-69f79c5d7b-php68   1/1     Running   0          55m
ai-service-31-072z2gvhd6-876bcf789-p9swg    1/1     Running   0          55m
ai-service-32-m321gi6s76-68cb4d745d-79jr7   0/1     Pending   0          12m
ai-service-33-ogppkk7j9a-767b775f67-gfdzx   1/1     Running   0          55m
ai-service-34-yi6in6yxd3-6c7d7f5cfb-w2tvn   1/1     Running   0          55m
ai-service-35-e3e1gxur3c-7cb894c68f-cqxjv   1/1     Running   0          55m
ai-service-36-lp4oinnc4o-797f6d454c-48sgn   0/1     Pending   0          12m
ai-service-37-hf1j6dshed-6f6bfbcf5b-bpdn4   1/1     Running   0          55m
ai-service-38-ehncl0weuv-fbc89f76c-2nhm9    0/1     Pending   0          102s
ai-service-38-ehncl0weuv-fbc89f76c-56zkw    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-5dnql    1/1     Running   0          55m
ai-service-38-ehncl0weuv-fbc89f76c-5mk2d    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-dfz9g    0/1     Pending   0          102s
ai-service-38-ehncl0weuv-fbc89f76c-kwkbf    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-n4twc    0/1     Pending   0          2m12s
ai-service-38-ehncl0weuv-fbc89f76c-s9fs8    0/1     Pending   0          2m12s
ai-service-38-ehncl0weuv-fbc89f76c-wns2q    0/1     Pending   0          117s
ai-service-38-ehncl0weuv-fbc89f76c-x2pnh    0/1     Pending   0          2m12s
ai-service-39-05iiqpvtwc-ddd74d659-qlw45    1/1     Running   0          12m


kubectl describe po -n inference ai-service-38-ehncl0weuv-fbc89f76c-5mk2d

Events:
  Type     Reason            Age   From           Message
  ----     ------            ----  ----           -------
  Warning  FailedScheduling  4m3s  4pd-scheduler  0/3 nodes are available: 2 node unregisterd. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod.


Capacity:
  cpu:                20
  ephemeral-storage:  101430960Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             41088492Ki
  nvidia.com/gpu:     24
  pods:               110
Allocatable:
  cpu:                20
  ephemeral-storage:  93478772582
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             40986092Ki
  nvidia.com/gpu:     24
  pods:               110



Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                9700m (48%)  10500m (52%)
  memory             350Mi (0%)   1000Mi (2%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     9            9
Events:              <none>
```

显卡信息

```
Sat Apr 19 12:01:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:0A:00.0 Off |                  N/A |
| 30%   31C    P8             19W /  350W |    8134MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    983567      C   ...local/ai/.venv/bin/python       1012MiB |
|    0   N/A  N/A    984134      C   ...local/ai/.venv/bin/python       1018MiB |
|    0   N/A  N/A    984728      C   ...local/ai/.venv/bin/python       1008MiB |
|    0   N/A  N/A    985224      C   ...local/ai/.venv/bin/python       1010MiB |
|    0   N/A  N/A    985615      C   ...local/ai/.venv/bin/python       1010MiB |
|    0   N/A  N/A    986369      C   ...local/ai/.venv/bin/python       1008MiB |
|    0   N/A  N/A    986605      C   ...local/ai/.venv/bin/python       1012MiB |
|    0   N/A  N/A   1002357      C   ...local/ai/.venv/bin/python       1010MiB |
+-----------------------------------------------------------------------------------------+
```

我的理解应该是可以启动 24 个 Pod 的，但实质上只能运行 8 个。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

关于 vGPU 分配 Pods 数量限制不符合预期的问题请教 #45

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

关于 vGPU 分配 Pods 数量限制不符合预期的问题请教 #45

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions