-
Notifications
You must be signed in to change notification settings - Fork 145
[YUNIKORN-3103] Add comprehensive e2e replication test suite for YuniKorn scheduler #978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[YUNIKORN-3103] Add comprehensive e2e replication test suite for YuniKorn scheduler #978
Conversation
- Add 13 comprehensive test cases covering various replication scenarios - Test ReplicaSets, Deployments, StatefulSets, multi-container pods, node affinity, rolling updates, priority classes, mixed workloads, and stress testing - Validate YuniKorn scheduler integration and resource allocation - Include performance optimizations for faster CI/CD execution - Add DeleteService method to k8s utils for test cleanup - Optimize timeouts and resource requests for 60% faster execution - Use parallel pod waiting and single lightweight image for efficiency
- Replace len() assertions with proper gomega matchers (ginkgolinter) - Fix formatting issues (gofmt) - Fix variable shadowing issue (govet) - Remove unnecessary embedded field access (staticcheck) - All golangci-lint checks now pass
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #978 +/- ##
==========================================
- Coverage 68.02% 65.90% -2.13%
==========================================
Files 70 72 +2
Lines 9195 9297 +102
==========================================
- Hits 6255 6127 -128
- Misses 2733 2959 +226
- Partials 207 211 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Could you please review the PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rrajesh-cloudera thanks for the PR. My only concern is that the test logic is VERY LONG in the current form and it's difficult to read.
At least let's get rid of the most commonly repeated parts to reduce the size of the tests.
Please look at my suggestions:
- Create helper methods which return a basic replica set, deployment and stateful set objects with all common fields populated.
- If you need changes, just call the method ie.
getReplicaSetSpec()
then modify the fields in the testcase code. - Replace all
gomega.Eventually()
call with appropriate helper calls fromKubeCtl
. These will be new methods.
As a first step, let's reduce the size of the overall change, then I'll go for a second round. |
…improved framework - Fix priority class conflict handling with proper error management for 'already exists' scenarios - Correct priority value assertions to match actual YuniKorn behavior (1000, 100 values) - Extend ReplicaSetConfig to support advanced features: * Init containers for multi-stage pod initialization * Sidecar containers for additional functionality * Node affinity rules for targeted scheduling - Enhance getReplicaSetSpec function to handle complex pod configurations - Add robust resource cleanup with proper error handling for edge cases - Implement comprehensive test coverage for 13 replication scenarios: * Basic replication setup verification * ReplicaSet pod replication and scaling * Deployment scaling operations * Pod failure recovery mechanisms * Resource allocation across replicas * Queue assignment consistency * Multi-container pod replication * Node affinity scheduling * Rolling update replication * StatefulSet replication with persistent identity * Stress testing with multiple replicas * Mixed workload replication (concurrent ReplicaSet/Deployment) * Priority class-based replication - Optimize linter performance and resolve all code quality issues - Add cleanupResource helper function for consistent resource management - Improve test framework with better error handling and validation All tests now pass successfully (13 Passed | 0 Failed) with execution time of ~4 minutes. Closes: Comprehensive e2e replication test suite implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some further comments.
I don't think we need all tests here, especially the "nodeAffinity" and "stress" stuff. These can become complicated very quickly.
I also don't see tuch much value in the following: Verify_Mixed_Workload_Replication
, Verify_Replication_With_Priority_Classes
.
The checks inside Verify_Resource_Allocation_Across_Replicas
and Verify_Queue_Assignment_Consistency
can be merged to
Verify_ReplicaSet_Pod_Replication
.
The rest can stay. By making these changes, code will be further reduced, this making it a smaller maintenance burden.
return nil | ||
} | ||
|
||
// Helper methods for creating basic workload objects with common fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either move these helper methods to replication_test.go
or put them to a separate file eg. k8s_specs.go
(if you intend to re-use them in the follow-up PR).
gomega.Ω(checks).To(gomega.Equal(""), checks) | ||
}) | ||
|
||
ginkgo.It("Verify_Basic_Replication_Setup", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this test at all? This performs a basic validation of a scheduled pod, but this is unrelated to workloads like Deployment, StatefulSet, etc.
If this is useful, it should be moved under basic_scheduling/basic_scheduling_test.go
gomega.Ω(err).NotTo(gomega.HaveOccurred()) | ||
}) | ||
|
||
ginkgo.It("Verify_Replication_Stress_Test", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this - not needed in a functional e2e suite.
for _, pod := range podList.Items { | ||
gomega.Ω(pod.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn") | ||
gomega.Ω(pod.Status.Phase).To(gomega.Equal(v1.PodRunning), "Pod should be running") | ||
|
||
container := pod.Spec.Containers[0] | ||
cpuReq := container.Resources.Requests[v1.ResourceCPU] | ||
memReq := container.Resources.Requests[v1.ResourceMemory] | ||
gomega.Ω(cpuReq.String()).To(gomega.Equal("50m"), "CPU request should be 50m") | ||
gomega.Ω(memReq.String()).To(gomega.Equal("64Mi"), "Memory request should be 64Mi") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC these checks can be added to an existing test "Verify_ReplicaSet_Pod_Replication".
for _, pod := range podList.Items { | ||
gomega.Ω(pod.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn") | ||
gomega.Ω(pod.Status.Phase).To(gomega.Equal(v1.PodRunning), "Pod should be running") | ||
|
||
if appID, exists := pod.Labels["applicationId"]; exists { | ||
appIDs = append(appIDs, appID) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC these checks can be added to an existing test "Verify_ReplicaSet_Pod_Replication".
gomega.Ω(err).NotTo(gomega.HaveOccurred()) | ||
}) | ||
|
||
ginkgo.It("Verify_Replication_With_Priority_Classes", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, priority stuff is tested separately elsewhere extensively.
gomega.Ω(err).NotTo(gomega.HaveOccurred()) | ||
}) | ||
|
||
ginkgo.It("Verify_Rolling_Update_Replication", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: "Verify_Deployment_RollingUpdate"
gomega.Ω(podObj.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn") | ||
}) | ||
|
||
ginkgo.It("Verify_ReplicaSet_Pod_Replication", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name: "Verify_ReplicaSet_Scheduling"
}) | ||
}) | ||
|
||
ginkgo.It("Verify_Deployment_Scaling_Operations", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name: "Verify_Deployment_Scheduling"
gomega.Ω(err).NotTo(gomega.HaveOccurred()) | ||
}) | ||
|
||
ginkgo.It("Verify_Pod_Failure_Recovery", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name: "Verify_ReplicaSet_PodRestart"
- Rename package from 'replication_test' to 'workload_test' for better clarity - Reduce test count from 13 to 6 focused tests (54% reduction, -449 lines) - Remove complex tests (nodeAffinity, stress, mixed workload, priority classes) - Remove basic pod scheduling test (belongs in basic_scheduling suite) - Merge resource allocation and queue assignment checks into main ReplicaSet test - Rename test functions for consistency: * Verify_Pod_Failure_Recovery → Verify_ReplicaSet_PodRestart * Verify_Deployment_Scaling_Operations → Verify_Deployment_Scheduling * Verify_ReplicaSet_Pod_Replication → Verify_ReplicaSet_Scheduling * Verify_Rolling_Update_Replication → Verify_Deployment_RollingUpdate - Move workload-specific helper functions from k8s_utils.go to test file: * GetBasicDeploymentSpec, GetBasicStatefulSetSpec, GetBasicHeadlessService * Reduces k8s_utils.go bloat and keeps helpers close to usage - Clean up redundant code: * Remove unnecessary 'if len() > 0' checks where length already verified * Fix indentation and remove excessive blank lines * Add robust cleanup error handling - Update suite configuration: * TestReplication → TestWorkload * TEST-replication_junit.xml → TEST-workload_junit.xml * Describe('Replication') → Describe('Workload Test') All 6 remaining tests pass successfully (118s execution time): ✓ Verify_ReplicaSet_Scheduling - ReplicaSet with resource & queue verification ✓ Verify_Deployment_Scheduling - Deployment creation and scaling ✓ Verify_ReplicaSet_PodRestart - Pod failure recovery ✓ Verify_Multi_Container_Pod_Replication - Multi-container pods with init containers ✓ Verify_Deployment_RollingUpdate - Rolling update functionality ✓ Verify_StatefulSet_Replication - StatefulSet with persistent identity Result: Focused, maintainable test suite with comprehensive workload coverage
What is this PR for?
What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-3103
How should this be tested?
Screenshots (if appropriate)
Questions: