Skip to content

feat: Add GPU usage to pod view #3437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

NirLevy98
Copy link

@NirLevy98 NirLevy98 commented Jul 6, 2025

Description

Adds a GPU column to the pods view that displays the total number of GPUs requested by each pod.

Motivation

When managing GPU workloads in Kubernetes clusters, it's useful to quickly see which pods are consuming GPU resources directly in the k9s interface without having to describe each pod individually.

image

@NirLevy98 NirLevy98 force-pushed the add-gpus-to-pod-panel branch 3 times, most recently from 816c10b to 12797d1 Compare July 7, 2025 07:24
@NirLevy98
Copy link
Author

@derailed
Can you CR me please ?
Thanks!

Copy link
Owner

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NirLevy98 Nice! figured that would be coming next. Thank you for this update!

@@ -67,6 +68,7 @@ var defaultPodHeader = model1.Header{
model1.HeaderColumn{Name: "%CPU/L", Attrs: model1.Attrs{Align: tview.AlignRight, MX: true}},
model1.HeaderColumn{Name: "%MEM/R", Attrs: model1.Attrs{Align: tview.AlignRight, MX: true}},
model1.HeaderColumn{Name: "%MEM/L", Attrs: model1.Attrs{Align: tview.AlignRight, MX: true}},
model1.HeaderColumn{Name: "GPUS", Attrs: model1.Attrs{Align: tview.AlignRight}},
Copy link
Owner

@derailed derailed Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be good to keep consistency with cpu aka GPU R/L' %GPU/R' '%GPU/L' as separate columns one can sort or see where they are at.
We should volunteer gpu reporting in the container view as well.
What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derailed Done! I’ve updated it to match the CPU/MEM pattern with the GPU, GPU/RL, %GPU/R, and %GPU/L columns.
As for the container view, I think it’s nice to have, but not a must.
Anyway, I’d like to merge this feature if that’s okay, and start using it :)

@NirLevy98 NirLevy98 force-pushed the add-gpus-to-pod-panel branch 2 times, most recently from 3839765 to 820e6d1 Compare July 13, 2025 12:52
Copy link
Owner

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NirLevy98 Thank you for the updates. I have actually started working on this...
I think we should update container view as well so we remain consistent across all resource usage.

@@ -191,6 +199,10 @@ func (p *Pod) defaultRow(pwm *PodWithMetrics, row *model1.Row) error {
client.ToPercentageStr(c.cpu, r.lcpu),
client.ToPercentageStr(c.mem, r.mem),
client.ToPercentageStr(c.mem, r.lmem),
strconv.FormatInt(g.current, 10),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use toMi here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't work for me

@NirLevy98 NirLevy98 force-pushed the add-gpus-to-pod-panel branch 2 times, most recently from f75040f to ac2546a Compare July 13, 2025 17:14
@NirLevy98 NirLevy98 requested a review from derailed July 13, 2025 17:54
@NirLevy98
Copy link
Author

@derailed WDYT ? :)

Copy link
Owner

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NirLevy98 Thank you for the update. Though I do appreciate your enthuse to get this out, I hope you can appreciate that I end up having to support all code that makes it thru review. It's close but needs some TLC

cc := make([]v1.Container, 0, len(spec.InitContainers)+len(spec.Containers))
cc = append(cc, filterSidecarCO(spec.InitContainers)...)
cc = append(cc, spec.Containers...)

// Get CPU and Memory requests/limits
rcpu, rmem := cosRequests(cc)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above, this should return rgpu i.e no need for cosGPU below since we can collect all request/limits in one swoop for cpu,gpu and mem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Check requests
if container.Resources.Requests != nil {
for _, gpuResource := range config.KnownGPUVendors {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will go away but we can dry this up into a helper that we can reuse when collecting request/limit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this function. I’m not sure we need to move it now. Can you take a look at the new code and let me know what you think?

@@ -661,6 +664,166 @@ func TestCheckPhase(t *testing.T) {
}
}

func TestPodGPUCalculation(t *testing.T) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all fold under gatherCoMX

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@NirLevy98 NirLevy98 force-pushed the add-gpus-to-pod-panel branch 8 times, most recently from 9ca83df to 4d927c3 Compare July 15, 2025 10:02
@NirLevy98 NirLevy98 requested a review from derailed July 15, 2025 10:02
@NirLevy98
Copy link
Author

NirLevy98 commented Jul 15, 2025

@derailed
Fixed, and all the tests passed

image

@NirLevy98 NirLevy98 force-pushed the add-gpus-to-pod-panel branch 2 times, most recently from c1e7b8d to c5c60f3 Compare July 15, 2025 10:53
@NirLevy98 NirLevy98 force-pushed the add-gpus-to-pod-panel branch from c5c60f3 to 2a833e4 Compare July 15, 2025 10:58
Copy link
Owner

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NirLevy98 Thank you for these updates!
Definitely better but still no quite what I'd envisioned:

  1. Container need to expose gpu so user can track which container is pulling resources
  2. GPU should be treated as just another resource

I need to push a new release and running out of time on these reviews. I'll have some of this implemented in the next drop and we can iterate from there.
Thank you for your support and guidance on seeing this thru.

@@ -67,6 +68,7 @@ var defaultPodHeader = model1.Header{
model1.HeaderColumn{Name: "%CPU/L", Attrs: model1.Attrs{Align: tview.AlignRight, MX: true}},
model1.HeaderColumn{Name: "%MEM/R", Attrs: model1.Attrs{Align: tview.AlignRight, MX: true}},
model1.HeaderColumn{Name: "%MEM/L", Attrs: model1.Attrs{Align: tview.AlignRight, MX: true}},
model1.HeaderColumn{Name: "GPU", Attrs: model1.Attrs{Align: tview.AlignRight, MX: true}},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should expose request/limit for gpu resources.

mem.Add(*requests.Memory())
}

for _, gpuResource := range config.KnownGPUVendors {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this should be a helper

return
}

func cosLimits(cc []v1.Container) (cpuQ, memQ resource.Quantity) {
func cosRequests(cc []v1.Container) (cpuQ, memQ resource.Quantity, gpu int64) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is gpu different? it should be a Quantity

@derailed derailed mentioned this pull request Jul 15, 2025
@derailed
Copy link
Owner

Closing as of v0.50.8

@derailed derailed closed this Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants