is it possible to reload nvidia kernel modules if they were unloaded? #12143
Unanswered
tan-wei-xin-alez
asked this question in
Q&A
Replies: 1 comment 3 replies
-
|
The Talos API is the only process allowed to load and unload kernel modules so you'll have to submit a patch to the API to load the kernel modules. I'm not sure if this requires a reboot |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
so a follow up to how to run nvidia-bug-report or equivalent, the approach there worked because we had already rebooted the node but we have another node with the same error
this time though, the same approach does not work as we get the following error
we can get around it by removing
"runtimeClassName": "nvidia"but obviously, the error is popping up because the kernel modules for nvidia have been unloadedSince Talos is an immutable OS, I guess there is no way to reload the nvidia kernel modules if the above happens without rebooting the machine, am I correct? Or is there some really hacky way to do it? Because we have other pods still running on the node that can reach the other GPUs on the node (although they will experience an error with
nvmlif they try to use them), it's just that this one GPU experienced this error and unloaded the nvidia kernel modules in the processBeta Was this translation helpful? Give feedback.
All reactions