Skip to content

storage: correct source statistics for multi-replica clusters #32676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Jun 6, 2025

Before, with sources running on multi-replica clusters (which are not
enabled in prod right now), we would overcount some metrics.
Specifically, there a gauge-metrics which would work correctly and there
are counter metrics where we would overcount. For example, each instance
of the source running on each replica would report messages_received.

Now, we treat counter metrics differently: for those metrics where all
sources are reporting basically the same counter update we only take the
updates from the "first" replica. For gauges we just use all of them,
and the assumption is that these are largely the same on all replicas.

The one counter metric that is treated different from the others is
updates_committed: only one replica will manage to commit updates, so
we have to make sure we take this counter in from all replicas, to
capture all messages that have been committed.

Implements https://github.com/MaterializeInc/database-issues/issues/9010

WIP: @def- I still need to add a test that has sources and multi-replica clusters.

@aljoscha aljoscha requested a review from a team as a code owner June 6, 2025 15:34
@aljoscha aljoscha force-pushed the storage-correct-source-stats-for-multi-replica-clusters branch from c8a3ad8 to b50f321 Compare June 6, 2025 16:54
@aljoscha aljoscha requested review from petrosagg and martykulma June 6, 2025 16:54
@aljoscha
Copy link
Contributor Author

aljoscha commented Jun 6, 2025

@martykulma & @petrosagg how do we like this approach?

@aljoscha
Copy link
Contributor Author

aljoscha commented Jun 9, 2025

Updated to also include the same implementation for sinks

aljoscha added 3 commits June 9, 2025 09:59
…date

Before, we were using `SourceStatisticsUpdate` both to collect and
transmit statistics from cluster to the controller _and_ to keep those
statistics in the controller. We want to make changes to how the
controller incorporates statistics updates from the clusters, so we need
to now separate these.

IMO, this in itself is already an improvement because
`SourceStatisticsUpdate` has a lot of functionality that is not needed
in the controller slash can be confusing when read in the context of the
controller.
Before, with sources running on multi-replica clusters (which are not
enabled in prod right now), we would overcount some metrics.
Specifically, there a gauge-metrics which would work correctly and there
are counter metrics where we would overcount. For example, each instance
of the source running on each replica would report `messages_received`.

Now, we treat counter metrics differently: for those metrics where all
sources are reporting basically the same counter update we only take the
updates from the "first" replica. For gauges we just use all of them,
and the assumption is that these are largely the same on all replicas.

The one counter metric that is treated different from the others is
`updates_committed`: only one replica will manage to commit updates, so
we have to make sure we take this counter in from all replicas, to
capture all messages that have been committed.

Implements MaterializeInc/database-issues#9010
Before, with sinks running on multi-replica clusters (which are not
enabled in prod right now), we would overcount some metrics.

Now, we treat counter metrics differently: for those metrics where all
sinks are reporting basically the same counter update we only take the
updates from the "first" replica.

The counter metrics that are treated different from the others are the
`*_committed` metrics: only one replica will manage to commit updates,
so we have to make sure we take this counter in from all replicas, to
capture all messages that have been committed.

Implements MaterializeInc/database-issues#9015
@aljoscha aljoscha force-pushed the storage-correct-source-stats-for-multi-replica-clusters branch from d115035 to 11fef79 Compare June 9, 2025 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant