Replace policy and action limiters with a checkin limiter #3255

michel-laterman · 2024-02-05T23:56:24Z

What is the problem this PR solves?

Scale tests for multiple policy changes are failing. A contributing factor is the policy limiter which increases the time it takes for policies to be dispatched (and the policy mutex lock to be held).

How does this PR solve the problem?

Replace the separate policy and action limiters with a unified limiter in the checkinT struct that is used if a response action (which includes policy change actions that are generated by the policy monitor) is detected in the checkin response, and gzip in enabled.

This means that the policyMonitor will dispatch pending policies much faster and release the lock so a policy may be updated/new subscriptions may be processed, but our checkin responses are still rate limited so we can reuse our gzip pool.

Note that the action_limit settings will be used and the policy_limit settings are ignored.

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation pr here
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates Policy monitor load issues #3254

Replace the seperate policy and action limiters with a unified limiter in the checkinT struct that is used if a response action (which includes policy change actions that are generated by the policy monitor) is detected in the checkin response, and gzip in enabled.

michel-laterman · 2024-02-06T15:29:21Z

buildkite run perf-tests

michel-laterman · 2024-02-06T18:23:36Z

serverless perf tests have failed due to ongoing AZ issues forcing containers to go oom.
ECS perf test run is here: https://buildkite.com/elastic/observability-perf/builds/2367#018d7f6d-39a4-4480-90f7-09c8e788ab95 and it looks to have succeeded.

michel-laterman · 2024-02-06T20:40:06Z

Another perf-test attempt: https://buildkite.com/elastic/observability-perf/builds/2368#018d8025-f107-4791-8f7a-e50d10ea513d

michel-laterman · 2024-02-07T00:12:54Z

latest ecs perf-test: https://buildkite.com/elastic/observability-perf/builds/2370#018d8079-17f9-4085-afd5-29bc28fe0433
looks like it's succeeding

juliaElastic · 2024-02-07T08:50:26Z

internal/pkg/api/handleCheckin.go

+	} else if cfg.Limits.ActionLimit.Interval == 0 && cfg.Limits.PolicyThrottle == 0 {
+		rt = rate.Inf
+	}
+	zerolog.Ctx(context.TODO()).Debug().Any("event_rate", rt).Int("burst", cfg.Limits.ActionLimit.Burst).Msg("checkin response gzip limiter")


nit: shouldn't there be a ctx passed instead of context.TODO()?

Ideally yes, but we would need to make sure that the context is tied to the function instead of checkinT's lifecycle.
We also use context.TODO in a few other places similar to this.

There's an existing issue to address this: #3087

juliaElastic

Code LGTM

juliaElastic · 2024-02-08T12:32:39Z

internal/pkg/api/handleCheckin.go

@@ -99,6 +108,7 @@ func NewCheckinT(
 		gcp:    gcp,
 		ad:     ad,
 		tr:     tr,
+		limit:  rate.NewLimiter(rt, cfg.Limits.ActionLimit.Burst),


There is no ActionLimit.Max setting, does it mean only the Burst is being limited?

interval and burst are used to configure the rate limiter (interval is the time it takes for 1 token to be added to the rate limit pool, burst is max pool size)
the max attributes was used to add limits to the total number of connections allowed on an endpoint (here we would use the checkin endpoint setting).

michel-laterman · 2024-02-08T16:23:42Z

buildkite test this

michel-laterman · 2024-02-08T16:50:29Z

buildkite run perf-tests

elastic-sonarqube · 2024-02-09T09:24:31Z

Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
88.9% 88.9% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

)" This reverts commit c67e65d.

)" (#3274) This reverts commit c67e65d.

michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team labels Feb 5, 2024

Finish up unifying throttles

5cdd609

michel-laterman marked this pull request as ready for review February 6, 2024 19:35

michel-laterman requested a review from a team as a code owner February 6, 2024 19:35

michel-laterman mentioned this pull request Feb 6, 2024

Deprecate the fleet-server policy_throttle, use action_limit instead. elastic/ingest-docs#895

Merged

michel-laterman added enhancement New feature or request and removed bug Something isn't working labels Feb 6, 2024

revert default interval change

469a987

juliaElastic reviewed Feb 7, 2024

View reviewed changes

michel-laterman and others added 3 commits February 7, 2024 09:02

Merge branch 'main' into unify-limiters

ec839a1

fix merge

2ef1aa6

fix integration tests after merge

0915476

juliaElastic approved these changes Feb 8, 2024

View reviewed changes

juliaElastic reviewed Feb 8, 2024

View reviewed changes

michel-laterman and others added 3 commits February 8, 2024 08:33

Merge branch 'main' into unify-limiters

3e3e13d

Increase defaults for action_limit burst size

ad77db9

Merge branch 'main' into unify-limiters

b9f17fb

Merge branch 'main' into unify-limiters

274068d

michel-laterman merged commit c67e65d into elastic:main Feb 9, 2024
8 checks passed

michel-laterman deleted the unify-limiters branch February 9, 2024 13:51

michel-laterman added a commit that referenced this pull request Feb 12, 2024

Revert "Replace policy and action limiters with a checkin limiter (#3255

ee661c3

)" This reverts commit c67e65d.

michel-laterman mentioned this pull request Feb 12, 2024

Revert "Replace policy and action limiters with a checkin limiter" #3274

Merged

michel-laterman added a commit that referenced this pull request Feb 13, 2024

Revert "Replace policy and action limiters with a checkin limiter (#3255

8dd3652

)" (#3274) This reverts commit c67e65d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace policy and action limiters with a checkin limiter #3255

Replace policy and action limiters with a checkin limiter #3255

michel-laterman commented Feb 5, 2024 •

edited

Loading

michel-laterman commented Feb 6, 2024

michel-laterman commented Feb 6, 2024

michel-laterman commented Feb 6, 2024

michel-laterman commented Feb 7, 2024

juliaElastic Feb 7, 2024

michel-laterman Feb 7, 2024

michel-laterman Feb 7, 2024

juliaElastic left a comment

juliaElastic Feb 8, 2024 •

edited

Loading

michel-laterman Feb 8, 2024

michel-laterman commented Feb 8, 2024

michel-laterman commented Feb 8, 2024

elastic-sonarqube bot commented Feb 9, 2024

Replace policy and action limiters with a checkin limiter #3255

Replace policy and action limiters with a checkin limiter #3255

Conversation

michel-laterman commented Feb 5, 2024 • edited Loading

What is the problem this PR solves?

How does this PR solve the problem?

Design Checklist

Checklist

Related issues

michel-laterman commented Feb 6, 2024

michel-laterman commented Feb 6, 2024

michel-laterman commented Feb 6, 2024

michel-laterman commented Feb 7, 2024

juliaElastic Feb 7, 2024

Choose a reason for hiding this comment

michel-laterman Feb 7, 2024

Choose a reason for hiding this comment

michel-laterman Feb 7, 2024

Choose a reason for hiding this comment

juliaElastic left a comment

Choose a reason for hiding this comment

juliaElastic Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

michel-laterman Feb 8, 2024

Choose a reason for hiding this comment

michel-laterman commented Feb 8, 2024

michel-laterman commented Feb 8, 2024

elastic-sonarqube bot commented Feb 9, 2024

Quality Gate passed

michel-laterman commented Feb 5, 2024 •

edited

Loading

juliaElastic Feb 8, 2024 •

edited

Loading