-
Notifications
You must be signed in to change notification settings - Fork 61
fix: Implement Taint and Toleration Handling for NicClusterPolicy #1544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 15584574385Details
💛 - Coveralls |
/retest-nic_operator_helm |
/retest-nic_operator_kind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements taint-aware node filtering and enhances OFED deployment by handling GPU tolerations and adding new predicates for Node taint changes.
- Introduces a new NodeTaintChangedPredicate for filtering Node events based on taint changes.
- Adds GPU toleration support in state_ofed.go via the addGPUToleration function and updates manifest rendering to use processed tolerations.
- Expands test coverage for node filter behavior and controller responses to taint/toleration changes.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/state/state_ofed_test.go | Updates tests to validate GPU toleration and node filtering logic |
pkg/state/state_ofed.go | Adds GPU toleration helper, adjusts manifest object generation |
pkg/nodeinfo/filter_test.go | Introduces comprehensive tests for composite node filtering |
pkg/nodeinfo/filter.go | Implements NodeTaintFilter and composite filters |
controllers/predicate.go | Adds NodeTaintChangedPredicate for Node event filtering |
controllers/nicclusterpolicy_controller_test.go | Enhances tests for taint and toleration interactions in NCP |
controllers/nicclusterpolicy_controller.go | Updates controller setup to include new Node predicates |
Comments suppressed due to low confidence (1)
pkg/nodeinfo/filter.go:175
- The unschedulable taint key is hard-coded. Consider defining it as a constant to improve readability and maintainability.
if taint.Key == "node.kubernetes.io/unschedulable" && taint.Effect == corev1.TaintEffectNoSchedule {
|
||
if len(objs) == 0 { | ||
// GetManifestObjects returned no objects, this means that no objects need to be applied to the cluster | ||
// as (most likely) no Mellanox hardware is found (No mellanox labels where found). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a duplicated empty check for objs after waitForStaleObjectsRemoval. Please add a clarifying comment to explain the positioning of this check and confirm that this ordering is intentional.
// as (most likely) no Mellanox hardware is found (No mellanox labels where found). | |
// as (most likely) no Mellanox hardware is found (No mellanox labels where found). | |
// This check is intentionally placed after waitForStaleObjectsRemoval to ensure that we handle cases | |
// where no objects are returned but stale objects might still exist and need to be removed. |
Copilot uses AI. Check for mistakes.
pkg/state/state_ofed.go
Outdated
Key: "nvidia.com/gpu", | ||
Effect: v1.TaintEffectNoSchedule, | ||
Operator: v1.TolerationOpExists, // Operator can be Exists if Value is not specified, or Equal if Value is specified. | ||
// Value: "true", // Often GPU taints have a value, but Exists is fine if the taint is just by key and effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] When comparing tolerations in addGPUToleration, consider also comparing the Value field if GPU taints might include a value. This will ensure that the toleration check remains robust in all scenarios.
// Value: "true", // Often GPU taints have a value, but Exists is fine if the taint is just by key and effect. | |
Value: "", // Default to an empty value; update if specific taints require a value. |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
823cffc
to
1a9c1a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once @heyvister1 comment will be addressed
@dahalperin you need to sign-off your commits to pass the DCO check |
Introduces taint-aware node filtering and OFED deployment. - Controller now watches Node taint changes. - New `NodeTaintChangedPredicate` and composite node filter (labels & taints) for OFED node selection. - `state_ofed` uses this filter and adds a default `nvidia.com/gpu` toleration to OFED DaemonSets. - NCP tolerations are respected. - Extensive new tests for controller, filtering, and OFED state. Signed-off-by: Dana Halperin <[email protected]>
75bab8f
to
988f748
Compare
/retest-nic_operator_helm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/retest-nic_operator_helm |
/retest-nic_operator_helm |
Introduces taint-aware node filtering and OFED deployment.
NodeTaintChangedPredicate
and composite node filter (labels & taints) for OFED node selection.state_ofed
uses this filter and adds a defaultnvidia.com/gpu
toleration to OFED DaemonSets.