Sync flush time #300

titolins · 2025-03-06T12:53:57Z

Context

The idea is to have all AM instances flushing alerts at the same/similar time. This should reduce the number of duplicated notifications and help reduce escalations.

The strategy is to add a new stage to the notification pipeline. On grafana/prometheus-alertmanager#101, we're adding the pipeline time to the notification entry, so we can retrieve this in the next flush. We then compare the current pipeline time with the expected next flush (summing the previous pipeline with the group wait and considering a margin). If the current execution time is similar to the expected next flush, we execute it immediately. Otherwise, wait for nextFlush - currentPipelineTime.

Depends on

Sync flush time prometheus-alertmanager#101

Related PR's

Alerting: setup AM sync flush time stage grafana#101931

titolins · 2025-03-11T11:57:50Z

go.mod

@@ -105,7 +105,7 @@ require (
 )

 // Using a fork of the Alertmanager with Alerting Squad specific changes.
-replace github.com/prometheus/alertmanager => github.com/grafana/prometheus-alertmanager v0.25.1-0.20250305143719-fa9fa7096626
+replace github.com/prometheus/alertmanager => github.com/grafana/prometheus-alertmanager v0.25.1-0.20250306181800-10368d39a559


should be updated on the AM PR is merged

titolins · 2025-03-11T11:57:57Z

go.sum

@@ -194,8 +194,8 @@ github.com/google/uuid v1.5.0 h1:1p67kYwdtXjb0gL0BPiP1Av9wiZPo5A8z2cWkTZ+eyU=
 github.com/google/uuid v1.5.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
 github.com/googleapis/gax-go/v2 v2.0.4/go.mod h1:0Wqv26UfaUD9n4G6kQubkQ+KchISgw+vpHVxEJEs9eg=
 github.com/googleapis/gax-go/v2 v2.0.5/go.mod h1:DWXyrwAJ9X0FpwwEdw+IPEYBICEFu5mhpdKc/us6bOk=
-github.com/grafana/prometheus-alertmanager v0.25.1-0.20250305143719-fa9fa7096626 h1:QsMYtDseSPq8hXvoNtA64unFiawJaE5kryizcMsVZWg=
-github.com/grafana/prometheus-alertmanager v0.25.1-0.20250305143719-fa9fa7096626/go.mod h1:FGdGvhI40Dq+CTQaSzK9evuve774cgOUdGfVO04OXkw=
+github.com/grafana/prometheus-alertmanager v0.25.1-0.20250306181800-10368d39a559 h1:E7az+c68g6E5X0mt8ZRm9+wZ530EWSQUTHgVZ1AKBRQ=


notify/grafana_alertmanager.go

rwwiv · 2025-03-11T14:38:18Z

notify/stages/sync_flush_stage_test.go

+			sync:         true,
+			entries:      []*nflogpb.Entry{},
+			pipelineTime: now,
+			expectedErr:  false,


WDYT about actually including the expected error here, or at least a string we can check against?

yeah, I got a bit lazy on these tests tbh 🙈
added proper errors now and comparing against those 👍

rwwiv · 2025-03-11T14:53:27Z

notify/stages/sync_flush_stage.go

+
+	if sfs.sync {
+		select {
+		case <-time.After(wait):


I wonder if we should update the context with a new "Now" based on the wait.

That's a good point, maybe we should.. not sure what would be the impact of that in the other stages, but if we're simulating a flush delay I guess that it would be the right thing..

cc @yuri-tceretian since it would affect the extra dedup specifically

No, the timeNow must be the time when snapshot of the aggrGroup was taken. This step runs after that and does not make the new snapshot.
I think after it wakes up it needs to run some kind of "coordination step" but without warning log and proceed if there wasn't any state from the future. Otherwise, it should immediately exit pipeline and start waiting for group_interval

Co-authored-by: William Wernert <[email protected]>

yuri-tceretian · 2025-03-24T20:31:21Z

notify/stages/sync_flush_stage.go

+		return ctx, nil, ErrMissingGroupInterval
+	}
+
+	entries, err := sfs.nflog.Query(nflog.QGroupKey(gkey), nflog.QReceiver(sfs.recv))


I think this will break when user adds a new integration to the receiver, and basically, will delay old integrations while not delaying new integration.

yuri-tceretian · 2025-03-24T20:45:02Z

notify/grafana_alertmanager.go

@@ -892,6 +901,9 @@ func (am *GrafanaAlertmanager) createReceiverStage(name string, integrations []*
 			Idx:         uint32(integrations[i].Index()),
 		}
 		var s notify.MultiStage
+		if stage := stages.NewSyncFlushStage(notificationLog, recv, syncAct, syncMargin); stage != nil {


I think this stage should run before Fanout stage, perhaps right after meshStage or right after inhibitionStage . This will guarantee that the pipeline is delayed correctly.

github-project-automation bot added this to Alerting Mar 6, 2025

github-project-automation bot moved this to In review in Alerting Mar 6, 2025

titolins force-pushed the titolins/sync-flush-time branch from 403a844 to 3cbf5ac Compare March 10, 2025 11:35

titolins mentioned this pull request Mar 10, 2025

Sync flush time grafana/prometheus-alertmanager#101

Closed

upgrade prom-am

adc10e9

titolins force-pushed the titolins/sync-flush-time branch from f550b92 to 865562c Compare March 10, 2025 15:00

titolins added 2 commits March 10, 2025 17:24

add new stage

d7fb678

use new stage

2a1d7d6

titolins force-pushed the titolins/sync-flush-time branch from 5e8b0ce to f14994a Compare March 10, 2025 16:25

titolins mentioned this pull request Mar 11, 2025

Alerting: setup AM sync flush time stage grafana/grafana#101931

Closed

titolins force-pushed the titolins/sync-flush-time branch from f14994a to 2a1d7d6 Compare March 11, 2025 11:44

titolins added 2 commits March 11, 2025 12:47

undo formatting changes

ef75be4

undo formatting changes

cf00965

titolins commented Mar 11, 2025

View reviewed changes

titolins marked this pull request as ready for review March 11, 2025 12:01

titolins requested a review from a team as a code owner March 11, 2025 12:01

fix: use group interval instead

be5387f

rwwiv reviewed Mar 11, 2025

View reviewed changes

titolins and others added 5 commits March 11, 2025 17:31

Update notify/grafana_alertmanager.go

92eeb44

Co-authored-by: William Wernert <[email protected]>

type errors

342f734

update ctx now

2500cac

fix tests

73be07e

improve coverage / add nflog tests

503c30f

titolins requested a review from rwwiv March 11, 2025 18:07

titolins added 5 commits March 12, 2025 21:42

fix name in tests

ffe4db3

improve comment

f2cc679

handle case in which prevPipeline in the future

081c12a

increase flush timeout in case sync stage is enabled

5c0fe48

improve comment

c31c3bc

yuri-tceretian reviewed Mar 24, 2025

View reviewed changes

titolins closed this May 6, 2025

github-project-automation bot moved this from In review to Done in Alerting May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync flush time #300

Sync flush time #300

Uh oh!

titolins commented Mar 6, 2025 •

edited

Loading

Uh oh!

titolins Mar 11, 2025

Uh oh!

titolins Mar 11, 2025

Uh oh!

Uh oh!

rwwiv Mar 11, 2025

Uh oh!

titolins Mar 11, 2025

Uh oh!

rwwiv Mar 11, 2025

Uh oh!

titolins Mar 11, 2025 •

edited

Loading

Uh oh!

yuri-tceretian Mar 24, 2025

Uh oh!

yuri-tceretian Mar 24, 2025

Uh oh!

yuri-tceretian Mar 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Sync flush time #300

Sync flush time #300

Uh oh!

Conversation

titolins commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Depends on

Related PR's

Uh oh!

titolins Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

titolins Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rwwiv Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

titolins Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

rwwiv Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

titolins Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuri-tceretian Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

yuri-tceretian Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

yuri-tceretian Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

titolins commented Mar 6, 2025 •

edited

Loading

titolins Mar 11, 2025 •

edited

Loading

yuri-tceretian Mar 24, 2025 •

edited

Loading