Skip to content

MQE: eager loading of selectors #11624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Jun 4, 2025

What this PR does

This PR introduces the ability to eagerly load selectors in MQE.

When combined with LazyQueryable, this means that selectors are evaluated in parallel in the background while the query is evaluated. This is important for using MQE in query-frontends, as this allows each part of a shardable query to be evaluated in parallel, rather than serially.

Which issue(s) this PR fixes or relates to

#10067

Checklist

  • Tests updated.
  • [n/a] Documentation added.
  • [covered by Mimir Query Engine #10067] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [n/a] about-versioning.md updated with experimental features.

@charleskorn charleskorn marked this pull request as ready for review June 4, 2025 05:41
@charleskorn charleskorn requested a review from a team as a code owner June 4, 2025 05:41
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good with a few suggestions. Mostly I'm just struggling to understand the context of the change and when it's going to be required or not required (required for #11417 but not required once querysharding is incorporated into MQE?)

@@ -18,6 +18,12 @@ type EngineOpts struct {
// Should only be used in tests.
Pedantic bool `yaml:"-"`

// Prometheus' engine evaluates all selectors (ie. calls Querier.Select()) before evaluating any part of the query.
// We rely on this behavior in query-frontends when evaluating shardable queries so that all selectors are evaluated in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand what parts of this change with the current architecture (non-MQE query-frontend) are required and how it will change the behavior. Is the purpose of this option to make MQE act like the Prometheus engine within the query-frontend? I'm missing which parts are going to be executed in parallel and where (query-frontend? queriers?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus' engine makes all of the Querier.Select calls for a query upfront, before evaluating any part of the query.

MQE makes the Querier.Select call for a selector when SeriesMetadata is called on the selector operator. In practice, this usually means all Querier.Select calls are done upfront before any evaluation occurs, but there are some exceptions (eg. queries that include absent() and absent_over_time()). Crucially, MQE iterates through the returned series set in Selector.SeriesMetadata so that it can return the list of series the selector will return.

In queriers, Mimir provides the engine an ordinary Queryable, so this means that queries issue Select calls in series regardless of which engine is used.

In query-frontends, when running a sharded query or a query with subquery spin-off, Mimir provides a LazyQueryable to the engine, which returns a LazyQuerier. LazyQuerier.Select() starts a background goroutine to evaluate the selector and returns a lazy series set immediately. This lazy series set only blocks on receiving the result when the engine starts iterating through it.

When using Prometheus' engine in query-frontends, this allows the different selectors to be evaluated concurrently, and the query doesn't block waiting for a selector to be evaluated until it's required by the engine. This means that different sharded legs of a query are evaluated in parallel, for example.

However, when using MQE in query-frontends, because Selector.SeriesMetadata iterates through the series set immediately after calling Select, the call blocks immediately and no other Select calls are issued until the current one completes.

So the goal of this PR is to have a way for MQE to make all Select calls for a query before iterating over any series set, so that a query with many sharded legs can still have each leg evaluated in parallel.

Does that answer your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you.

testutils.RequireEqualResults(t, expr, baselineResult, eagerLoadingResult, false)

// Each Select call through our mocked slow storage takes at least 2s, so if they aren't run in parallel, then the query will take over 4s to run.
require.Less(t, duration, 4*time.Second, "expected selectors to be evaluated in parallel")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good suggestion for testing this but I really don't want to add a unit test that relies on timing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't love this either - the other option I considered was some logic in slowQuerier to have the two Select calls wait for the other to be running before continuing, but I worried that would be more difficult to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be preferable. I'll see if I can come up with something latch-like in a separate branch.

Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock but I'd like to try using some form of synchronization to avoid relying on timing in tests.

testutils.RequireEqualResults(t, expr, baselineResult, eagerLoadingResult, false)

// Each Select call through our mocked slow storage takes at least 2s, so if they aren't run in parallel, then the query will take over 4s to run.
require.Less(t, duration, 4*time.Second, "expected selectors to be evaluated in parallel")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be preferable. I'll see if I can come up with something latch-like in a separate branch.

@56quarters
Copy link
Contributor

I think that would be preferable. I'll see if I can come up with something latch-like in a separate branch.

I opened #11660 which I think demonstrates how to do this. I confirmed it is testing the right thing by removing the lazy-loading wrapper from the Queryable - the test hangs in that case.

@charleskorn
Copy link
Contributor Author

I think that would be preferable. I'll see if I can come up with something latch-like in a separate branch.

I opened #11660 which I think demonstrates how to do this. I confirmed it is testing the right thing by removing the lazy-loading wrapper from the Queryable - the test hangs in that case.

Nice, thanks for the suggestion. I've incorporated that into e204ce2.

# Conflicts:
#	pkg/streamingpromql/operators/selectors/instant_vector_selector.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants