Skip to content

Conversation

rrajesh-cloudera
Copy link
Contributor

@rrajesh-cloudera rrajesh-cloudera commented Jul 18, 2025

What is this PR for?

  • Add 13 comprehensive test cases covering various replication scenarios
  • Test ReplicaSets, Deployments, StatefulSets, multi-container pods, node affinity, rolling updates, priority classes, mixed workloads, and stress testing
  • Validate YuniKorn scheduler integration and resource allocation
  • Include performance optimizations for faster CI/CD execution
  • Add DeleteService method to k8s utils for test cleanup
  • Optimize timeouts and resource requests for 60% faster execution
  • Use parallel pod waiting and single lightweight image for efficiency

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-3103

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

- Add 13 comprehensive test cases covering various replication scenarios
- Test ReplicaSets, Deployments, StatefulSets, multi-container pods, node affinity, rolling updates, priority classes, mixed workloads, and stress testing
- Validate YuniKorn scheduler integration and resource allocation
- Include performance optimizations for faster CI/CD execution
- Add DeleteService method to k8s utils for test cleanup
- Optimize timeouts and resource requests for 60% faster execution
- Use parallel pod waiting and single lightweight image for efficiency
- Replace len() assertions with proper gomega matchers (ginkgolinter)
- Fix formatting issues (gofmt)
- Fix variable shadowing issue (govet)
- Remove unnecessary embedded field access (staticcheck)
- All golangci-lint checks now pass
Copy link

codecov bot commented Jul 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.90%. Comparing base (76aaf7e) to head (805db49).
⚠️ Report is 28 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #978      +/-   ##
==========================================
- Coverage   68.02%   65.90%   -2.13%     
==========================================
  Files          70       72       +2     
  Lines        9195     9297     +102     
==========================================
- Hits         6255     6127     -128     
- Misses       2733     2959     +226     
- Partials      207      211       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rrajesh-cloudera
Copy link
Contributor Author

Could you please review the PR ?
@pbacsko @manirajv06

@rrajesh-cloudera rrajesh-cloudera changed the title Add comprehensive e2e replication test suite for YuniKorn scheduler [YUNIKORN-3103] Add comprehensive e2e replication test suite for YuniKorn scheduler Jul 18, 2025
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rrajesh-cloudera thanks for the PR. My only concern is that the test logic is VERY LONG in the current form and it's difficult to read.
At least let's get rid of the most commonly repeated parts to reduce the size of the tests.

Please look at my suggestions:

  1. Create helper methods which return a basic replica set, deployment and stateful set objects with all common fields populated.
  2. If you need changes, just call the method ie. getReplicaSetSpec() then modify the fields in the testcase code.
  3. Replace all gomega.Eventually() call with appropriate helper calls from KubeCtl. These will be new methods.

@pbacsko
Copy link
Contributor

pbacsko commented Aug 5, 2025

As a first step, let's reduce the size of the overall change, then I'll go for a second round.

…improved framework

- Fix priority class conflict handling with proper error management for 'already exists' scenarios
- Correct priority value assertions to match actual YuniKorn behavior (1000, 100 values)
- Extend ReplicaSetConfig to support advanced features:
  * Init containers for multi-stage pod initialization
  * Sidecar containers for additional functionality
  * Node affinity rules for targeted scheduling
- Enhance getReplicaSetSpec function to handle complex pod configurations
- Add robust resource cleanup with proper error handling for edge cases
- Implement comprehensive test coverage for 13 replication scenarios:
  * Basic replication setup verification
  * ReplicaSet pod replication and scaling
  * Deployment scaling operations
  * Pod failure recovery mechanisms
  * Resource allocation across replicas
  * Queue assignment consistency
  * Multi-container pod replication
  * Node affinity scheduling
  * Rolling update replication
  * StatefulSet replication with persistent identity
  * Stress testing with multiple replicas
  * Mixed workload replication (concurrent ReplicaSet/Deployment)
  * Priority class-based replication
- Optimize linter performance and resolve all code quality issues
- Add cleanupResource helper function for consistent resource management
- Improve test framework with better error handling and validation

All tests now pass successfully (13 Passed | 0 Failed) with execution time of ~4 minutes.

Closes: Comprehensive e2e replication test suite implementation
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some further comments.

I don't think we need all tests here, especially the "nodeAffinity" and "stress" stuff. These can become complicated very quickly.
I also don't see tuch much value in the following: Verify_Mixed_Workload_Replication, Verify_Replication_With_Priority_Classes.

The checks inside Verify_Resource_Allocation_Across_Replicas and Verify_Queue_Assignment_Consistency can be merged to
Verify_ReplicaSet_Pod_Replication.

The rest can stay. By making these changes, code will be further reduced, this making it a smaller maintenance burden.

return nil
}

// Helper methods for creating basic workload objects with common fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either move these helper methods to replication_test.go or put them to a separate file eg. k8s_specs.go (if you intend to re-use them in the follow-up PR).

gomega.Ω(checks).To(gomega.Equal(""), checks)
})

ginkgo.It("Verify_Basic_Replication_Setup", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test at all? This performs a basic validation of a scheduled pod, but this is unrelated to workloads like Deployment, StatefulSet, etc.

If this is useful, it should be moved under basic_scheduling/basic_scheduling_test.go

gomega.Ω(err).NotTo(gomega.HaveOccurred())
})

ginkgo.It("Verify_Replication_Stress_Test", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this - not needed in a functional e2e suite.

Comment on lines 458 to 467
for _, pod := range podList.Items {
gomega.Ω(pod.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn")
gomega.Ω(pod.Status.Phase).To(gomega.Equal(v1.PodRunning), "Pod should be running")

container := pod.Spec.Containers[0]
cpuReq := container.Resources.Requests[v1.ResourceCPU]
memReq := container.Resources.Requests[v1.ResourceMemory]
gomega.Ω(cpuReq.String()).To(gomega.Equal("50m"), "CPU request should be 50m")
gomega.Ω(memReq.String()).To(gomega.Equal("64Mi"), "Memory request should be 64Mi")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC these checks can be added to an existing test "Verify_ReplicaSet_Pod_Replication".

Comment on lines 500 to 506
for _, pod := range podList.Items {
gomega.Ω(pod.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn")
gomega.Ω(pod.Status.Phase).To(gomega.Equal(v1.PodRunning), "Pod should be running")

if appID, exists := pod.Labels["applicationId"]; exists {
appIDs = append(appIDs, appID)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC these checks can be added to an existing test "Verify_ReplicaSet_Pod_Replication".

gomega.Ω(err).NotTo(gomega.HaveOccurred())
})

ginkgo.It("Verify_Replication_With_Priority_Classes", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, priority stuff is tested separately elsewhere extensively.

gomega.Ω(err).NotTo(gomega.HaveOccurred())
})

ginkgo.It("Verify_Rolling_Update_Replication", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: "Verify_Deployment_RollingUpdate"

gomega.Ω(podObj.Spec.SchedulerName).To(gomega.Equal("yunikorn"), "Pod should be scheduled by YuniKorn")
})

ginkgo.It("Verify_ReplicaSet_Pod_Replication", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name: "Verify_ReplicaSet_Scheduling"

})
})

ginkgo.It("Verify_Deployment_Scaling_Operations", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name: "Verify_Deployment_Scheduling"

gomega.Ω(err).NotTo(gomega.HaveOccurred())
})

ginkgo.It("Verify_Pod_Failure_Recovery", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name: "Verify_ReplicaSet_PodRestart"

- Rename package from 'replication_test' to 'workload_test' for better clarity
- Reduce test count from 13 to 6 focused tests (54% reduction, -449 lines)
- Remove complex tests (nodeAffinity, stress, mixed workload, priority classes)
- Remove basic pod scheduling test (belongs in basic_scheduling suite)
- Merge resource allocation and queue assignment checks into main ReplicaSet test
- Rename test functions for consistency:
  * Verify_Pod_Failure_Recovery → Verify_ReplicaSet_PodRestart
  * Verify_Deployment_Scaling_Operations → Verify_Deployment_Scheduling
  * Verify_ReplicaSet_Pod_Replication → Verify_ReplicaSet_Scheduling
  * Verify_Rolling_Update_Replication → Verify_Deployment_RollingUpdate

- Move workload-specific helper functions from k8s_utils.go to test file:
  * GetBasicDeploymentSpec, GetBasicStatefulSetSpec, GetBasicHeadlessService
  * Reduces k8s_utils.go bloat and keeps helpers close to usage

- Clean up redundant code:
  * Remove unnecessary 'if len() > 0' checks where length already verified
  * Fix indentation and remove excessive blank lines
  * Add robust cleanup error handling

- Update suite configuration:
  * TestReplication → TestWorkload
  * TEST-replication_junit.xml → TEST-workload_junit.xml
  * Describe('Replication') → Describe('Workload Test')

All 6 remaining tests pass successfully (118s execution time):
✓ Verify_ReplicaSet_Scheduling - ReplicaSet with resource & queue verification
✓ Verify_Deployment_Scheduling - Deployment creation and scaling
✓ Verify_ReplicaSet_PodRestart - Pod failure recovery
✓ Verify_Multi_Container_Pod_Replication - Multi-container pods with init containers
✓ Verify_Deployment_RollingUpdate - Rolling update functionality
✓ Verify_StatefulSet_Replication - StatefulSet with persistent identity

Result: Focused, maintainable test suite with comprehensive workload coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants