Skip to content

Conversation

rrajesh-cloudera
Copy link
Contributor

@rrajesh-cloudera rrajesh-cloudera commented Jul 30, 2025

  • Create new autoscaler test directory with 9 comprehensive test cases
  • Add basic autoscaling functionality tests
  • Add complex scenarios: preemption, multi-queue competition, gang scheduling
  • Add proper lint compliance with nolint annotations for cleanup operations
  • All tests pass with proper YuniKorn integration and resource management

Test coverage includes:

  • Pod scaling with YuniKorn scheduler
  • Resource utilization metrics for autoscaling
  • Queue-based resource management
  • Rapid scaling operations
  • Preemption during autoscaling
  • Multi-queue autoscaling with resource competition
  • Autoscaling with preemption policies
  • Rapid scaling with preemption cascades
  • Gang scheduling autoscaling with preemption

What is this PR for?

A few sentences describing the overall goals of the pull request's commits.
First time? Check out the contributing guide - http://yunikorn.apache.org/community/how_to_contribute

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-3109

- Create new autoscaler test directory with 9 comprehensive test cases
- Add basic autoscaling functionality tests
- Add complex scenarios: preemption, multi-queue competition, gang scheduling
- Optimize test performance: 47% faster execution (3m8s vs 5m54s)
- Reduce resource requests: CPU 200m->50m, Memory 256Mi->64Mi
- Optimize timeouts: 480s->90-150s with faster polling intervals
- Add proper lint compliance with nolint annotations for cleanup operations
- All tests pass with proper YuniKorn integration and resource management

Test coverage includes:
- Pod scaling with YuniKorn scheduler
- Resource utilization metrics for autoscaling
- Queue-based resource management
- Rapid scaling operations
- Preemption during autoscaling
- Multi-queue autoscaling with resource competition
- Autoscaling with preemption policies
- Rapid scaling with preemption cascades
- Gang scheduling autoscaling with preemption
@rrajesh-cloudera rrajesh-cloudera changed the title Add comprehensive Autoscaler test suite [YUNIKORN-3109] Add comprehensive Autoscaler test suite Jul 30, 2025
@rrajesh-cloudera
Copy link
Contributor Author

rrajesh-cloudera commented Jul 31, 2025

@pbacsko @manirajv06 Please review

Copy link

codecov bot commented Jul 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.90%. Comparing base (6703e4b) to head (a355811).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #981      +/-   ##
==========================================
- Coverage   68.13%   65.90%   -2.24%     
==========================================
  Files          72       72              
  Lines        9295     9297       +2     
==========================================
- Hits         6333     6127     -206     
- Misses       2748     2959     +211     
+ Partials      214      211       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pbacsko
Copy link
Contributor

pbacsko commented Jul 31, 2025

@rrajesh-cloudera can you check why the tests failed?

- Replace WaitForPodBySelectorRunning with robust Eventually blocks to eliminate race conditions
- Add comprehensive debugging output for pod status tracking
- Wait for exact expected number of running pods instead of relying on flawed helper function
- Improve error handling and verification throughout the test

Fixes the core race condition where test would pass before all intended pods were actually running.
@rrajesh-cloudera
Copy link
Contributor Author

@rrajesh-cloudera can you check why the tests failed?

Fixed flaky test

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here is the first round. My biggest complaint (similar to #987) is the overuse of gomega.Eventually() with sometimes very specific wait conditions which are hard to understand.
As a first step, let's try to reduce those. Use the simples possible condition - ideally an existing Wait...() method from KubeCtl.

Second, you can delete all defer cleanup calls.
Third, just use simple variable names and avoid tricky constructs like &[]int32{4}[0].

- Rename test directory: autoscaler/ → replica_scaling/
- Update all terminology from 'Autoscaler/Autoscaling' to 'ReplicaScaling'
- Fix misleading naming: tests focus on pod replica scaling, not cluster autoscaling
- Update package names, function names, test names, and documentation
- Fix race condition in WaitForPodBySelectorRunning helper function:
  * Now properly waits for pods to exist before checking running state
  * Uses polling to handle deployment controller timing
- All tests now pass: 9/9 success rate
- Consolidated helper functions and eliminated gomega.Eventually overuse
- Improved variable naming (aggressor → highPrio) for clarity

Fixes test failures in rapid scaling, preemption, and gang scheduling scenarios.
- Add WaitForPodCountBySelectorRunning helper function to wait for exact pod counts
- Fix rapid scaling test to properly wait for scale down to exactly 2 pods
- Improve timing reliability for CI environments
- Address assertion failure where expected 2 pods but got 3 during scale down
- All 9 replica scaling tests now pass consistently (100% success rate)

This resolves CI failures in the rapid scaling operations test.
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with a different: "Deployment Scaling" tests (so "deployment_scaling", "Deployment_Scaling", depending on where it occurs).
The tests could have a slightly more straightforward names, but let's get back to them in a later round.

Looking at the test scanarios, I beleve the following two are worth keeping:

  • Verify_Pod_Scaling_With_YuniKorn_Scheduler --> scale 2->3 but also 3->2 (you can add extra step after 2->3 to see the new pod is indeed in the proper queue, etc)
  • Verify_Preemption_During_ReplicaScaling_Operations

The rest of the tests are very complicated and difficult to understand. If they start to fail at any point during development, it might take hours or days to figure out what goes wrong.

I definitely appreciate the time you put into writing them, but we need to look at these from different angles and the potential maintenance burden is significant.


// WaitForPodCondition waits for a custom pod condition to be met
func (k *KubeCtl) WaitForPodCondition(namespace string, condition PodConditionFunc, timeout time.Duration) error {
return wait.PollUntilContextTimeout(context.TODO(), time.Millisecond*100, timeout, false, func(ctx context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use time.Second instead of 100ms, that's too rough for the API server


// Wait for exactly the specified number of pods with given selector to be running
func (k *KubeCtl) WaitForPodCountBySelectorRunning(namespace string, selector string, expectedCount int, timeout time.Duration) error {
return wait.PollUntilContextTimeout(context.TODO(), time.Second*2, timeout, false, func(ctx context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: just use time.Second for interval

},
},
RestartPolicy: v1.RestartPolicyAlways,
TerminationGracePeriodSeconds: func() *int64 { grace := int64(5); return &grace }(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do the following:

gracePeriod := 5
...
TerminationGracePeriodSeconds: &gracePeriod 

More more readable.

})

// Test replica scaling with gang scheduling and preemption
It("Verify_Gang_Scheduling_ReplicaScaling_With_Preemption", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest this test be dropped. It looks too complicated and difficult to maintain. There's also an ambiguous part gangRunning > 0.

})

// Test resource utilization tracking for replica scaling decisions
It("Verify_Resource_Utilization_Metrics_For_ReplicaScaling", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't really test any scaling, just basic pod scheduling. Can be dropped.

…intainability

- Renamed replica_scaling -> deployment_scaling for better terminology
- Simplified from 7+ complex tests to 2 core maintainable scenarios:
  * Verify_Pod_Scaling_With_YuniKorn_Scheduler (2->3->2 scaling with queue verification)
  * Verify_Preemption_During_Deployment_Scaling_Operations (priority-based preemption)
- Improved API server performance: polling interval 100ms -> 1s
- Enhanced code quality:
  * Consistent *v1.Pod pointer usage
  * Readable variable declarations (gracePeriod := int64(5))
  * Fixed linter warnings and improved type handling
- Made tests more robust with flexible YuniKorn allocation count checks
- All tests pass successfully with improved performance and maintainability
…aitForPodCountBySelectorRunning

This addresses the specific review comment about using consistent time.Second
interval for API polling in WaitForPodCountBySelectorRunning function,
changing from time.Second*2 to time.Second for better consistency.
- Merge variable declaration with assignment (S1021)
- Changes: var cond wait.ConditionFunc; cond = func() to cond := wait.ConditionFunc(func()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants