Skip to content

feat: implement strict multi-zone pod distribution for StatefulSets (PSCLOUD-64)#701

Draft
abhikumar2204 wants to merge 10 commits intomainfrom
pr-pscloud-64
Draft

feat: implement strict multi-zone pod distribution for StatefulSets (PSCLOUD-64)#701
abhikumar2204 wants to merge 10 commits intomainfrom
pr-pscloud-64

Conversation

@abhikumar2204
Copy link
Contributor

@abhikumar2204 abhikumar2204 commented Feb 13, 2026

Adds comprehensive multi-zone pod distribution to prevent StatefulSet quorum
loss during zone failures in AKS, EKS, and GKE clusters.

Features:

  • Balanced topology spread constraints (maxSkew: 1) for optimal zone distribution
  • Host-level anti-affinity to prevent multiple pods on same node
  • Direct transformer application with single-zone compatibility
  • Dedicated nodepool restriction for stateful workloads
  • Configurable per-service (RabbitMQ, PostgreSQL) enablement

Changes:

  • Add multi-zone transformers with balanced constraints
  • Implement direct transformer application (simplified from zone detection)
  • Add nodepool restriction with agentpool=stateful requirement
  • Update VDM task pipeline to include multi-zone distribution
  • Add comprehensive configuration variables with sensible defaults
  • Update documentation with usage examples and behavior explanations

Transformers added:

  • rabbitmq-zone-distribution.yaml (balanced multi-zone)
  • postgres-zone-distribution.yaml (balanced multi-zone)
  • multi-zone-pod-distribution.yaml (general StatefulSets)
  • rabbitmq-single-zone-distribution.yaml (single-zone fallback)
  • postgres-single-zone-distribution.yaml (single-zone fallback)

Configuration variables:

  • V4_CFG_MULTI_ZONE_ENABLED (default: true)
  • V4_CFG_MULTI_ZONE_RABBITMQ_ENABLED (default: true)
  • V4_CFG_MULTI_ZONE_POSTGRES_ENABLED (default: true)
  • V4_CFG_STATEFUL_NODEPOOL_RESTRICTION (default: true)

Technical Implementation:

  • Topology spread: maxSkew: 1 with DoNotSchedule for balanced distribution
  • Node affinity: agentpool=stateful label restriction
  • Host distribution: maxSkew: 1 with DoNotSchedule for node-level spreading
  • Graceful degradation: Works in both single-zone and multi-zone clusters

Validation Results:

  • RabbitMQ (3 replicas) distributed across centralus-1, centralus-2, centralus-3
  • Consul (3 replicas) distributed across centralus-1, centralus-2, centralus-3
  • Redis (2 replicas) distributed across centralus-2, centralus-3
  • PostgreSQL instances distributed across multiple zones

Resolves: Multi-zone StatefulSet quorum protection requirements (PSCLOUD-64)
Supports: AKS, EKS, GKE multi-zone and single-zone deployments
Backward compatible: Works with existing deployments without configuration changes

@abhikumar2204 abhikumar2204 marked this pull request as draft February 13, 2026 06:05
@github-actions github-actions bot added the enhancement New feature or request label Feb 13, 2026
@abhikumar2204 abhikumar2204 self-assigned this Feb 13, 2026
@abhikumar2204 abhikumar2204 changed the title feat: implement strict multi-zone pod distribution for StatefulSets feat: implement strict multi-zone pod distribution for StatefulSets (PSCLOUD-64) Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant