-
Notifications
You must be signed in to change notification settings - Fork 617
Description
What is the problem you are trying to solve?
Rule group object load concurrency has been hardcoded to 10 since its inception.
We've seen extra-large ruler deployments with tens of thousands rule groups for a single tenant.
Let's assume to serve a request, we need to do 15k or so object store get
s. This can happen during normal syncs, or if the rules cache is disabled or needs refreshing.
Assuming perfect batching, each thread waits on 1500 serialized gets. With a normal p50 latency of 50ms, that's 75000ms or 75 seconds to download all the files.
Which solution do you envision (roughly)?
Prior to loading objects, the ruler knows the full set of object keys. I would propose it dynamically adjust its concurrency based on the key count. For small tenants, it can go below 10, and for huge tenants it can increase it to reduce configuration API latency.
That way, we are spending the object store ratelimit where it's most needed.
Have you considered any alternatives?
In large cells with many small tenants, it arguably doesn't need to be increased - and more overlapping requests can put a cell closer to object store ratelimits. So, I think simply increasing the concurrency is not enough - it's dangerous in the right type of cell.
Any additional context to share?
No response
How long do you think this would take to be developed?
Small (<= 1 month dev)
What are the documentation dependencies?
No response
Proposer?
No response