Skip to content

fix: Implement Taint and Toleration Handling for NicClusterPolicy #1544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dahalperin
Copy link
Contributor

Introduces taint-aware node filtering and OFED deployment.

  • Controller now watches Node taint changes.
  • New NodeTaintChangedPredicate and composite node filter (labels & taints) for OFED node selection.
  • state_ofed uses this filter and adds a default nvidia.com/gpu toleration to OFED DaemonSets.
  • NCP tolerations are respected.
  • Extensive new tests for controller, filtering, and OFED state.

@coveralls
Copy link
Collaborator

coveralls commented May 29, 2025

Pull Request Test Coverage Report for Build 15584574385

Details

  • 96 of 109 (88.07%) changed or added relevant lines in 4 files are covered.
  • 13 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.2%) to 60.285%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/state/state_ofed.go 32 34 94.12%
pkg/nodeinfo/filter.go 38 43 88.37%
controllers/predicate.go 7 13 53.85%
Files with Coverage Reduction New Missed Lines %
pkg/state/state_skel.go 4 83.82%
controllers/nicclusterpolicy_controller.go 9 76.68%
Totals Coverage Status
Change from base Build 15579268967: 0.2%
Covered Lines: 3593
Relevant Lines: 5960

💛 - Coveralls

@rollandf
Copy link
Member

rollandf commented Jun 4, 2025

/retest-nic_operator_helm

@dahalperin
Copy link
Contributor Author

/retest-nic_operator_kind

@rollandf rollandf requested a review from Copilot June 5, 2025 09:23
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements taint-aware node filtering and enhances OFED deployment by handling GPU tolerations and adding new predicates for Node taint changes.

  • Introduces a new NodeTaintChangedPredicate for filtering Node events based on taint changes.
  • Adds GPU toleration support in state_ofed.go via the addGPUToleration function and updates manifest rendering to use processed tolerations.
  • Expands test coverage for node filter behavior and controller responses to taint/toleration changes.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/state/state_ofed_test.go Updates tests to validate GPU toleration and node filtering logic
pkg/state/state_ofed.go Adds GPU toleration helper, adjusts manifest object generation
pkg/nodeinfo/filter_test.go Introduces comprehensive tests for composite node filtering
pkg/nodeinfo/filter.go Implements NodeTaintFilter and composite filters
controllers/predicate.go Adds NodeTaintChangedPredicate for Node event filtering
controllers/nicclusterpolicy_controller_test.go Enhances tests for taint and toleration interactions in NCP
controllers/nicclusterpolicy_controller.go Updates controller setup to include new Node predicates
Comments suppressed due to low confidence (1)

pkg/nodeinfo/filter.go:175

  • The unschedulable taint key is hard-coded. Consider defining it as a constant to improve readability and maintainability.
if taint.Key == "node.kubernetes.io/unschedulable" && taint.Effect == corev1.TaintEffectNoSchedule {


if len(objs) == 0 {
// GetManifestObjects returned no objects, this means that no objects need to be applied to the cluster
// as (most likely) no Mellanox hardware is found (No mellanox labels where found).
Copy link
Preview

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a duplicated empty check for objs after waitForStaleObjectsRemoval. Please add a clarifying comment to explain the positioning of this check and confirm that this ordering is intentional.

Suggested change
// as (most likely) no Mellanox hardware is found (No mellanox labels where found).
// as (most likely) no Mellanox hardware is found (No mellanox labels where found).
// This check is intentionally placed after waitForStaleObjectsRemoval to ensure that we handle cases
// where no objects are returned but stale objects might still exist and need to be removed.

Copilot uses AI. Check for mistakes.

Key: "nvidia.com/gpu",
Effect: v1.TaintEffectNoSchedule,
Operator: v1.TolerationOpExists, // Operator can be Exists if Value is not specified, or Equal if Value is specified.
// Value: "true", // Often GPU taints have a value, but Exists is fine if the taint is just by key and effect.
Copy link
Preview

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] When comparing tolerations in addGPUToleration, consider also comparing the Value field if GPU taints might include a value. This will ensure that the toleration check remains robust in all scenarios.

Suggested change
// Value: "true", // Often GPU taints have a value, but Exists is fine if the taint is just by key and effect.
Value: "", // Default to an empty value; update if specific taints require a value.

Copilot uses AI. Check for mistakes.

Copy link
Member

@rollandf rollandf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@dahalperin dahalperin force-pushed the fix_ofed_not_ready branch 2 times, most recently from 823cffc to 1a9c1a8 Compare June 9, 2025 09:18
rollandf
rollandf previously approved these changes Jun 9, 2025
Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once @heyvister1 comment will be addressed

@almaslennikov
Copy link
Collaborator

@dahalperin you need to sign-off your commits to pass the DCO check

Introduces taint-aware node filtering and OFED deployment.

- Controller now watches Node taint changes.
- New `NodeTaintChangedPredicate` and composite node filter
  (labels & taints) for OFED node selection.
- `state_ofed` uses this filter and adds a default `nvidia.com/gpu`
  toleration to OFED DaemonSets.
- NCP tolerations are respected.
- Extensive new tests for controller, filtering, and OFED state.
Signed-off-by: Dana Halperin <[email protected]>
@dahalperin dahalperin force-pushed the fix_ofed_not_ready branch from 75bab8f to 988f748 Compare June 11, 2025 12:15
@dahalperin
Copy link
Contributor Author

/retest-nic_operator_helm

Copy link
Collaborator

@almaslennikov almaslennikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@almaslennikov
Copy link
Collaborator

/retest-nic_operator_helm

@rollandf
Copy link
Member

/retest-nic_operator_helm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants