Closed
Conversation
Introduce Celery workers for asynchronous processing of resource management operations, enabling horizontal scalability and improved performance for Poolboy. Key changes: - Add Redis as message broker and shared cache backend - Implement Celery workers for ResourcePool, ResourceHandle, and ResourceClaim processing - Add partitioned queues with consistent hashing for event ordering - Implement distributed locking to prevent concurrent resource access - Create unified cache system shared between operator and workers - Add Celery Beat scheduler for periodic reconciliation tasks - Support three operation modes: daemon, scheduler, or both - Add HPA configuration for worker auto-scaling - Maintain backward compatibility with synchronous fallback The implementation uses feature flags to enable workers per resource type, allowing gradual migration and easy rollback. All existing business logic is preserved - workers execute the same class methods that previously ran synchronously in the operator.
The 'operator' directory conflicts with Python's stdlib 'operator' module when Celery imports it. Renaming the entry point to main.py and using the KOPF_OPERATORS env var eliminates the need for the poolboy_worker.py workaround. Changes: - Rename operator/operator.py to operator/main.py - Add KOPF_OPERATORS=main.py to operator deployment - Simplify worker/scheduler to use direct celery command - Remove poolboy_worker.py workaround
The celery command requires the operator directory in PYTHONPATH to find the 'tasks' module during autodiscovery. Changes: - Add workingDir: /opt/app-root/operator to worker and scheduler - Add PYTHONPATH=/opt/app-root/operator environment variable - Use direct celery command instead of shell wrapper
…mode ResourceHandleMatch comparison was preferring handles with known health/ready status over unknown status, causing newer handles to be selected before older ones when the older handles hadn't been processed yet. Changed comparison to only prefer healthy=True over healthy=False (not over None), ensuring creation_timestamp remains the final tiebreaker for FIFO ordering.
Status changes (approval, lifespan, auto-delete/detach conditions) do not trigger Kopf events - they are only detected when the daemon runs its next cycle (every 60 seconds). Adjustments: - Increase delay from 1s to 2s and retries from 10-20 to 30-45 - This allows up to 90 seconds for daemon-dependent operations - Update finalizer expectation to match simplified finalizer format
- Remove 9 workers_resource_* variables from poolboy.py - Remove workers_*_daemon_mode variables (3) - Remove operator_mode_distributed variable - Simplify is_standalone to read directly from IS_STANDALONE env var - Replace all workers_resource_* checks with "not is_standalone" in main.py - Daemons remain active for periodic processing This reduces complexity by moving mode logic to Helm templates. The operator now only needs to check is_standalone boolean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces Celery-based distributed task processing for Poolboy, enabling horizontal scalability and improved performance through asynchronous background processing of resource management operations.
Motivation
Poolboy supports a manager mode architecture where multiple operator replicas distribute event handling across pods. This approach provides good reliability and basic load distribution.
Celery integration complements this architecture by adding:
Solution Architecture
The implementation adds three new Kubernetes components alongside the existing Poolboy operator:
Processing Flow
When workers are enabled:
groupprimitive to batch process multiple resources in parallelThe operator falls back to synchronous processing when workers are disabled, ensuring backward compatibility.
Key Features
Feature Flags per Resource Type
Each resource class can be individually enabled for worker processing:
useWorkers.resourcePool.enableduseWorkers.resourceHandle.enableduseWorkers.resourceClaim.enabledThis allows gradual migration and rollback per component.
Operation Modes
Three modes control how periodic reconciliation is handled:
schedulerdaemonbothPartitioned Queues
Tasks are routed to partitioned queues using consistent hashing of
namespace/name:Distributed Locking
Redis-based distributed locks prevent concurrent processing of the same resource across workers:
Shared Redis Cache
A unified cache system enables state sharing between operator and workers:
Horizontal Pod Autoscaler
Workers support HPA based on CPU/memory utilization for automatic scaling under load.
Components Migrated
Note: ResourceClaim binding still occurs in the operator to leverage in-memory cache for pool handle discovery. Once bound, subsequent updates are processed by workers.
Configuration
All configuration is managed through Helm
values.yaml:Observability