Skip to content

Conversation

@aymericDD
Copy link
Contributor

@aymericDD aymericDD commented Dec 22, 2025

What does this PR do?

  • Adds new functionality
  • Alters existing functionality
  • Fixes a bug
  • Improves documentation or testing

Please briefly describe your changes as well as the motivation behind them:
Bug Fix: Cloud network disruptions (e.g., AWS S3) were only injecting --hosts arguments into the first chaos pod. When multiple targets were selected, subsequent
chaos pods were created without any hosts, causing the disruption to fail partially.

Root Cause: The r.Client.Status().Update(ctx, instance) call in createChaosPods (line 709) overwrites the local instance object with the API server's response.
Since UpdateHostsOnCloudDisruption modifies instance.Spec.Network.Hosts in memory only (not persisted to etcd), the hosts were cleared after the first pod creation.

Solution: Use DeepCopy() before calling Status().Update() to preserve in-memory spec changes. This follows Kubernetes controller-runtime best practices
(cluster-api #1259, controller-runtime
#2850
).

BTW: It also fix the flaky test.

image.png

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by depending on existing unit tests or end-to-end tests.
    • by adding new unit tests or end-to-end tests.
  • I manually tested the following steps:
    • Reproduction: Created a cloud disruption targeting AWS S3 with 2 target replicas. Before the fix, first chaos pod had 361 --hosts arguments, second pod had 0.
    • Verification: After the fix, both chaos pods receive all 361 cloud service IP ranges.
    • E2E Test: Ran ginkgo --focus "should create a cloud disruption but apply a host disruption with the list of cloud managed service ip ranges" - passes
      successfully.
    • locally.
    • as a canary deployment to a cluster.

Copy link
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@aymericDD aymericDD changed the title fix(network): preserve hosts in cloud disruptions [CHAOSPLT-1364] Fix cloud disruption hosts only injected in first chaos pod Dec 22, 2025
@aymericDD aymericDD force-pushed the aymeric.daurelle/CHAOSPLT-1364/fix branch from 623468b to 68d4ef6 Compare December 22, 2025 20:06
@aymericDD aymericDD force-pushed the aymeric.daurelle/CHAOSPLT-259/feat branch 2 times, most recently from 766b11f to 28c876d Compare December 22, 2025 20:27
@aymericDD aymericDD force-pushed the aymeric.daurelle/CHAOSPLT-1364/fix branch from 68d4ef6 to 205f48c Compare December 22, 2025 20:27
Use DeepCopy before Status().Update() to prevent
in-memory spec changes from being lost. Without this,
cloud disruption hosts were cleared after the first
chaos pod creation, causing subsequent pods to have
no hosts injected.

Jira: CHAOSPLT-1364
@aymericDD aymericDD force-pushed the aymeric.daurelle/CHAOSPLT-259/feat branch from 28c876d to 6f36370 Compare December 22, 2025 20:42
@aymericDD aymericDD force-pushed the aymeric.daurelle/CHAOSPLT-1364/fix branch from 205f48c to dea3090 Compare December 22, 2025 20:42
@aymericDD aymericDD marked this pull request as ready for review December 22, 2025 20:43
@aymericDD aymericDD requested a review from a team December 22, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants