[YUNIKORN-3103] Add comprehensive e2e replication test suite for YuniKorn scheduler #978

rrajesh-cloudera · 2025-07-18T06:21:57Z

What is this PR for?

Add 13 comprehensive test cases covering various replication scenarios
Test ReplicaSets, Deployments, StatefulSets, multi-container pods, node affinity, rolling updates, priority classes, mixed workloads, and stress testing
Validate YuniKorn scheduler integration and resource allocation
Include performance optimizations for faster CI/CD execution
Add DeleteService method to k8s utils for test cleanup
Optimize timeouts and resource requests for 60% faster execution
Use parallel pod waiting and single lightweight image for efficiency

What type of PR is it?

Todos

- Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-3103

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

- Add 13 comprehensive test cases covering various replication scenarios - Test ReplicaSets, Deployments, StatefulSets, multi-container pods, node affinity, rolling updates, priority classes, mixed workloads, and stress testing - Validate YuniKorn scheduler integration and resource allocation - Include performance optimizations for faster CI/CD execution - Add DeleteService method to k8s utils for test cleanup - Optimize timeouts and resource requests for 60% faster execution - Use parallel pod waiting and single lightweight image for efficiency

- Replace len() assertions with proper gomega matchers (ginkgolinter) - Fix formatting issues (gofmt) - Fix variable shadowing issue (govet) - Remove unnecessary embedded field access (staticcheck) - All golangci-lint checks now pass

codecov · 2025-07-18T06:46:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.90%. Comparing base (76aaf7e) to head (805db49).
⚠️ Report is 28 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #978      +/-   ##
==========================================
- Coverage   68.02%   65.90%   -2.13%     
==========================================
  Files          70       72       +2     
  Lines        9195     9297     +102     
==========================================
- Hits         6255     6127     -128     
- Misses       2733     2959     +226     
- Partials      207      211       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rrajesh-cloudera · 2025-07-18T07:34:29Z

Could you please review the PR ?
@pbacsko @manirajv06

pbacsko

Hey @rrajesh-cloudera thanks for the PR. My only concern is that the test logic is VERY LONG in the current form and it's difficult to read.
At least let's get rid of the most commonly repeated parts to reduce the size of the tests.

Please look at my suggestions:

Create helper methods which return a basic replica set, deployment and stateful set objects with all common fields populated.
If you need changes, just call the method ie. getReplicaSetSpec() then modify the fields in the testcase code.
Replace all gomega.Eventually() call with appropriate helper calls from KubeCtl. These will be new methods.

test/e2e/replication/replication_test.go

pbacsko · 2025-08-05T12:37:47Z

As a first step, let's reduce the size of the overall change, then I'll go for a second round.

…improved framework - Fix priority class conflict handling with proper error management for 'already exists' scenarios - Correct priority value assertions to match actual YuniKorn behavior (1000, 100 values) - Extend ReplicaSetConfig to support advanced features: * Init containers for multi-stage pod initialization * Sidecar containers for additional functionality * Node affinity rules for targeted scheduling - Enhance getReplicaSetSpec function to handle complex pod configurations - Add robust resource cleanup with proper error handling for edge cases - Implement comprehensive test coverage for 13 replication scenarios: * Basic replication setup verification * ReplicaSet pod replication and scaling * Deployment scaling operations * Pod failure recovery mechanisms * Resource allocation across replicas * Queue assignment consistency * Multi-container pod replication * Node affinity scheduling * Rolling update replication * StatefulSet replication with persistent identity * Stress testing with multiple replicas * Mixed workload replication (concurrent ReplicaSet/Deployment) * Priority class-based replication - Optimize linter performance and resolve all code quality issues - Add cleanupResource helper function for consistent resource management - Improve test framework with better error handling and validation All tests now pass successfully (13 Passed | 0 Failed) with execution time of ~4 minutes. Closes: Comprehensive e2e replication test suite implementation

pbacsko

I added some further comments.

I don't think we need all tests here, especially the "nodeAffinity" and "stress" stuff. These can become complicated very quickly.
I also don't see tuch much value in the following: Verify_Mixed_Workload_Replication, Verify_Replication_With_Priority_Classes.

The checks inside Verify_Resource_Allocation_Across_Replicas and Verify_Queue_Assignment_Consistency can be merged to
Verify_ReplicaSet_Pod_Replication.

The rest can stay. By making these changes, code will be further reduced, this making it a smaller maintenance burden.

pbacsko · 2025-08-13T09:33:03Z

test/e2e/framework/helpers/k8s/k8s_utils.go

 	return nil
 }
+
+// Helper methods for creating basic workload objects with common fields


Either move these helper methods to replication_test.go or put them to a separate file eg. k8s_specs.go (if you intend to re-use them in the follow-up PR).

pbacsko · 2025-08-13T09:35:53Z

test/e2e/replication/replication_test.go

+		gomega.Ω(checks).To(gomega.Equal(""), checks)
+	})
+
+	ginkgo.It("Verify_Basic_Replication_Setup", func() {


Do we need this test at all? This performs a basic validation of a scheduled pod, but this is unrelated to workloads like Deployment, StatefulSet, etc.

If this is useful, it should be moved under basic_scheduling/basic_scheduling_test.go

pbacsko · 2025-08-13T09:41:33Z

test/e2e/replication/replication_test.go

+		gomega.Ω(err).NotTo(gomega.HaveOccurred())
+	})
+
+	ginkgo.It("Verify_Replication_Stress_Test", func() {


You can remove this - not needed in a functional e2e suite.

pbacsko · 2025-08-13T09:44:01Z

test/e2e/replication/replication_test.go

+		for _, pod := range podList.Items {
+			gomega.Ω(pod.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn")
+			gomega.Ω(pod.Status.Phase).To(gomega.Equal(v1.PodRunning), "Pod should be running")
+
+			container := pod.Spec.Containers[0]
+			cpuReq := container.Resources.Requests[v1.ResourceCPU]
+			memReq := container.Resources.Requests[v1.ResourceMemory]
+			gomega.Ω(cpuReq.String()).To(gomega.Equal("50m"), "CPU request should be 50m")
+			gomega.Ω(memReq.String()).To(gomega.Equal("64Mi"), "Memory request should be 64Mi")
+		}


IIUC these checks can be added to an existing test "Verify_ReplicaSet_Pod_Replication".

pbacsko · 2025-08-13T09:44:40Z

test/e2e/replication/replication_test.go

+		for _, pod := range podList.Items {
+			gomega.Ω(pod.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn")
+			gomega.Ω(pod.Status.Phase).To(gomega.Equal(v1.PodRunning), "Pod should be running")
+
+			if appID, exists := pod.Labels["applicationId"]; exists {
+				appIDs = append(appIDs, appID)
+			}


IIUC these checks can be added to an existing test "Verify_ReplicaSet_Pod_Replication".

pbacsko · 2025-08-13T10:23:16Z

test/e2e/replication/replication_test.go

+		gomega.Ω(err).NotTo(gomega.HaveOccurred())
+	})
+
+	ginkgo.It("Verify_Replication_With_Priority_Classes", func() {


Same here, priority stuff is tested separately elsewhere extensively.

pbacsko · 2025-08-13T10:24:44Z

test/e2e/replication/replication_test.go

+		gomega.Ω(err).NotTo(gomega.HaveOccurred())
+	})
+
+	ginkgo.It("Verify_Rolling_Update_Replication", func() {


Naming: "Verify_Deployment_RollingUpdate"

pbacsko · 2025-08-13T10:26:32Z

test/e2e/replication/replication_test.go

+		gomega.Ω(podObj.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn")
+	})
+
+	ginkgo.It("Verify_ReplicaSet_Pod_Replication", func() {


Name: "Verify_ReplicaSet_Scheduling"

pbacsko · 2025-08-13T10:26:46Z

test/e2e/replication/replication_test.go

+		})
+	})
+
+	ginkgo.It("Verify_Deployment_Scaling_Operations", func() {


Name: "Verify_Deployment_Scheduling"

pbacsko · 2025-08-13T10:27:16Z

test/e2e/replication/replication_test.go

+		gomega.Ω(err).NotTo(gomega.HaveOccurred())
+	})
+
+	ginkgo.It("Verify_Pod_Failure_Recovery", func() {


Name: "Verify_ReplicaSet_PodRestart"

- Rename package from 'replication_test' to 'workload_test' for better clarity - Reduce test count from 13 to 6 focused tests (54% reduction, -449 lines) - Remove complex tests (nodeAffinity, stress, mixed workload, priority classes) - Remove basic pod scheduling test (belongs in basic_scheduling suite) - Merge resource allocation and queue assignment checks into main ReplicaSet test - Rename test functions for consistency: * Verify_Pod_Failure_Recovery → Verify_ReplicaSet_PodRestart * Verify_Deployment_Scaling_Operations → Verify_Deployment_Scheduling * Verify_ReplicaSet_Pod_Replication → Verify_ReplicaSet_Scheduling * Verify_Rolling_Update_Replication → Verify_Deployment_RollingUpdate - Move workload-specific helper functions from k8s_utils.go to test file: * GetBasicDeploymentSpec, GetBasicStatefulSetSpec, GetBasicHeadlessService * Reduces k8s_utils.go bloat and keeps helpers close to usage - Clean up redundant code: * Remove unnecessary 'if len() > 0' checks where length already verified * Fix indentation and remove excessive blank lines * Add robust cleanup error handling - Update suite configuration: * TestReplication → TestWorkload * TEST-replication_junit.xml → TEST-workload_junit.xml * Describe('Replication') → Describe('Workload Test') All 6 remaining tests pass successfully (118s execution time): ✓ Verify_ReplicaSet_Scheduling - ReplicaSet with resource & queue verification ✓ Verify_Deployment_Scheduling - Deployment creation and scaling ✓ Verify_ReplicaSet_PodRestart - Pod failure recovery ✓ Verify_Multi_Container_Pod_Replication - Multi-container pods with init containers ✓ Verify_Deployment_RollingUpdate - Rolling update functionality ✓ Verify_StatefulSet_Replication - StatefulSet with persistent identity Result: Focused, maintainable test suite with comprehensive workload coverage

rrajesh-cloudera added 2 commits July 18, 2025 11:50

Fix golangci-lint issues in replication tests

80f344a

- Replace len() assertions with proper gomega matchers (ginkgolinter) - Fix formatting issues (gofmt) - Fix variable shadowing issue (govet) - Remove unnecessary embedded field access (staticcheck) - All golangci-lint checks now pass

rrajesh-cloudera changed the title ~~Add comprehensive e2e replication test suite for YuniKorn scheduler~~ [YUNIKORN-3103] Add comprehensive e2e replication test suite for YuniKorn scheduler Jul 18, 2025

pbacsko assigned rrajesh-cloudera Jul 18, 2025

wilfred-s requested review from manirajv06, pbacsko and zhuqi-lucas July 22, 2025 05:19

pbacsko requested changes Aug 5, 2025

View reviewed changes

pbacsko mentioned this pull request Aug 5, 2025

[YUNIKORN-3109] Add comprehensive Autoscaler test suite #981

Open

7 tasks

rrajesh-cloudera requested a review from pbacsko August 6, 2025 07:46

pbacsko requested changes Aug 13, 2025

View reviewed changes

rrajesh-cloudera requested a review from pbacsko August 14, 2025 04:45

[YUNIKORN-3103] Add comprehensive e2e replication test suite for YuniKorn scheduler #978

Are you sure you want to change the base?

[YUNIKORN-3103] Add comprehensive e2e replication test suite for YuniKorn scheduler #978

Conversation

rrajesh-cloudera commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

codecov bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rrajesh-cloudera commented Jul 18, 2025

Uh oh!

pbacsko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbacsko commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbacsko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rrajesh-cloudera commented Jul 18, 2025 •

edited

Loading

codecov bot commented Jul 18, 2025 •

edited

Loading

pbacsko commented Aug 5, 2025 •

edited

Loading