Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: cache deletion delay safe call to ExpiredDisruptionDelay #576

Merged
merged 2 commits into from
Aug 24, 2022

Conversation

nathantournant
Copy link
Member

@nathantournant nathantournant commented Aug 22, 2022

What does this PR do?

  • Adds new functionality
  • Alters existing functionality
  • Fixes a bug
  • Improves documentation or testing

Please briefly describe your changes as well as the motivation behind them:

  • If ExpiredDisruptionGCDelay isn't fixed, the dynamic targeting observation cache crashes out the controller calling a non-existing pointer value. This removes the dependency on ExpiredDisruptionGCDelay for the dynamic targeting cache context timeout, and instead creates an independent goroutine to check on a regular basis the cache corresponds to an existing disruption, or is deleted.

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by depending on existing unit tests or end-to-end tests.
    • by adding new unit tests or end-to-end tests.
  • I manually tested the following steps:
    • Applied chart without ExpiredGCDelay set and StaticTargeting set to false, made sure it didn't crash. Compare with main to assert the bug reproduction.
    • locally.
    • as a canary deployment to a cluster.

Copy link
Contributor

@clairecng clairecng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Should we add a small documentation line that says when using dynamic targeting, disruption dynamic targeting expires after 2 minutes if the expiredDisruptionGCDelay is not set? (If I understood this line correctly)

cacheCtx, cacheCancelFunc := context.WithTimeout(context.Background(), instance.Spec.Duration.Duration()+*r.ExpiredDisruptionGCDelay*2)
deletionDelay := time.Minute * 2

if r.ExpiredDisruptionGCDelay != nil {
Copy link
Contributor

@expFlower expFlower Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ExpiredDisruptioonGCDelay is not set and it defaults to -1 will this have the desired effect?

In our environment we need to disable the GC of any disruption, see #497 for details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the value is set to an empty pointer there. it's what created the segfault

@@ -511,7 +511,13 @@ func (r *DisruptionReconciler) manageInstanceSelectorCache(instance *chaosv1beta

// start the cache with a cancelable context and duration, and attach it to the controller as a watch source
ch := make(chan error)
cacheCtx, cacheCancelFunc := context.WithTimeout(context.Background(), instance.Spec.Duration.Duration()+*r.ExpiredDisruptionGCDelay*2)
deletionDelay := time.Minute * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default value seems risky as a disruption can have no GC expiration delay and last longer than a couple of minutes. In this case, it would clear the cache for a still existing disruption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It would then recreate another cache with the disruption duration, and so on until the disruption is deleted. This is not a suitable solution.

Copy link
Contributor

@ptnapoleon ptnapoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's test in staging to confirm the new polling doesn't have too big a perf impact?

@nathantournant nathantournant force-pushed the nathan/bugfix-dynamictargeting branch from b62e21c to 0b93390 Compare August 24, 2022 13:44
@nathantournant nathantournant merged commit ca77fb2 into main Aug 24, 2022
@nathantournant nathantournant deleted the nathan/bugfix-dynamictargeting branch August 24, 2022 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

User Request: Release Dynamic Targeting behind a feature flag in controller
5 participants