Skip to content

Idea: Dynamic rule group load concurrency #12139

@alexweav

Description

@alexweav

What is the problem you are trying to solve?

Rule group object load concurrency has been hardcoded to 10 since its inception.

We've seen extra-large ruler deployments with tens of thousands rule groups for a single tenant.
Let's assume to serve a request, we need to do 15k or so object store gets. This can happen during normal syncs, or if the rules cache is disabled or needs refreshing.

Assuming perfect batching, each thread waits on 1500 serialized gets. With a normal p50 latency of 50ms, that's 75000ms or 75 seconds to download all the files.

Which solution do you envision (roughly)?

Prior to loading objects, the ruler knows the full set of object keys. I would propose it dynamically adjust its concurrency based on the key count. For small tenants, it can go below 10, and for huge tenants it can increase it to reduce configuration API latency.

That way, we are spending the object store ratelimit where it's most needed.

Have you considered any alternatives?

In large cells with many small tenants, it arguably doesn't need to be increased - and more overlapping requests can put a cell closer to object store ratelimits. So, I think simply increasing the concurrency is not enough - it's dangerous in the right type of cell.

Any additional context to share?

No response

How long do you think this would take to be developed?

Small (<= 1 month dev)

What are the documentation dependencies?

No response

Proposer?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions