-
Notifications
You must be signed in to change notification settings - Fork 98
Open
Description
vgpu-device-plugin 配置
apiVersion: v1
data:
config.json: |
{
"nodeconfig": [
{
"name": "m5-cloudinfra-online02",
"devicememoryscaling": 1,
"devicesplitcount": 24,
"disablecorelimit","true",
"migstrategy":"none"
}
]
}
kind: ConfigMap
节点显卡情况,一张 3090 的显卡,共 24G 显存,计划是启动 24 个 Pod,每个 Pod 分配 1500M 显存。
遇到问题:节点只能启动 8 个 Pod,其它的 Pod 不能被调度,想请教一下作者老哥是我哪里配置的姿势不对吗?还是 3090 消费型显卡不支持呀?
root@k8s-master-gpu-3090:~# kubectl get po -n inference | grep "i-service-"
ai-service-1-8yzevqj5pp-775bb7b687-t6fxv 1/1 Running 0 33h 这个没有分配显GPU,仅有CPU
ai-service-30-ztfy9fb0ou-69f79c5d7b-php68 1/1 Running 0 55m
ai-service-31-072z2gvhd6-876bcf789-p9swg 1/1 Running 0 55m
ai-service-32-m321gi6s76-68cb4d745d-79jr7 0/1 Pending 0 12m
ai-service-33-ogppkk7j9a-767b775f67-gfdzx 1/1 Running 0 55m
ai-service-34-yi6in6yxd3-6c7d7f5cfb-w2tvn 1/1 Running 0 55m
ai-service-35-e3e1gxur3c-7cb894c68f-cqxjv 1/1 Running 0 55m
ai-service-36-lp4oinnc4o-797f6d454c-48sgn 0/1 Pending 0 12m
ai-service-37-hf1j6dshed-6f6bfbcf5b-bpdn4 1/1 Running 0 55m
ai-service-38-ehncl0weuv-fbc89f76c-2nhm9 0/1 Pending 0 102s
ai-service-38-ehncl0weuv-fbc89f76c-56zkw 0/1 Pending 0 117s
ai-service-38-ehncl0weuv-fbc89f76c-5dnql 1/1 Running 0 55m
ai-service-38-ehncl0weuv-fbc89f76c-5mk2d 0/1 Pending 0 117s
ai-service-38-ehncl0weuv-fbc89f76c-dfz9g 0/1 Pending 0 102s
ai-service-38-ehncl0weuv-fbc89f76c-kwkbf 0/1 Pending 0 117s
ai-service-38-ehncl0weuv-fbc89f76c-n4twc 0/1 Pending 0 2m12s
ai-service-38-ehncl0weuv-fbc89f76c-s9fs8 0/1 Pending 0 2m12s
ai-service-38-ehncl0weuv-fbc89f76c-wns2q 0/1 Pending 0 117s
ai-service-38-ehncl0weuv-fbc89f76c-x2pnh 0/1 Pending 0 2m12s
ai-service-39-05iiqpvtwc-ddd74d659-qlw45 1/1 Running 0 12m
kubectl describe po -n inference ai-service-38-ehncl0weuv-fbc89f76c-5mk2d
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m3s 4pd-scheduler 0/3 nodes are available: 2 node unregisterd. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod.
Capacity:
cpu: 20
ephemeral-storage: 101430960Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 41088492Ki
nvidia.com/gpu: 24
pods: 110
Allocatable:
cpu: 20
ephemeral-storage: 93478772582
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 40986092Ki
nvidia.com/gpu: 24
pods: 110
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 9700m (48%) 10500m (52%)
memory 350Mi (0%) 1000Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 9 9
Events: <none>
显卡信息
Sat Apr 19 12:01:45 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:0A:00.0 Off | N/A |
| 30% 31C P8 19W / 350W | 8134MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 983567 C ...local/ai/.venv/bin/python 1012MiB |
| 0 N/A N/A 984134 C ...local/ai/.venv/bin/python 1018MiB |
| 0 N/A N/A 984728 C ...local/ai/.venv/bin/python 1008MiB |
| 0 N/A N/A 985224 C ...local/ai/.venv/bin/python 1010MiB |
| 0 N/A N/A 985615 C ...local/ai/.venv/bin/python 1010MiB |
| 0 N/A N/A 986369 C ...local/ai/.venv/bin/python 1008MiB |
| 0 N/A N/A 986605 C ...local/ai/.venv/bin/python 1012MiB |
| 0 N/A N/A 1002357 C ...local/ai/.venv/bin/python 1010MiB |
+-----------------------------------------------------------------------------------------+
我的理解应该是可以启动 24 个 Pod 的,但实质上只能运行 8 个。
Metadata
Metadata
Assignees
Labels
No labels