[sharddistributor][leaderelection] Introduce leader election mechanism #6889

3vilhamster · 2025-05-06T16:24:20Z

What changed?
Introducing new components that run the leader election process inside the shard distributor.

The elector handles the leader election using the leaderStore. It periodically reschedules to spread the load across hosts and uses a random pause to ensure the leader bounces around hosts.
LeaderStore is an abstraction on top of ETCD election API.
The processor allows running leader-elected guarded processes. This is a placeholder for now. It will host 2 processes: shard allocations and shard executors cleanup based on heartbeats.
The manager runs for each namespace, elector, and processor. Processor methods are pushed as callbacks to electors. This allows us to ensure that the process is stopped once the leader is elected and before the leader resigns.

Why?
This mechanism will be used for namespace management processes inside the shard distributor.

How did you test it?
Unit tests/Added option to run tests with ETCD. Will add integration tests later in a separate suite that will have ETCD.

Potential risks
ETCD dependencies might require handling.
This code is not yet used, and integration with sharddistributorfx and integration tests will be added later.

Release notes

Documentation Changes

go.mod

Groxx

Dropping some general thoughts from reading through it. I have not checked the tests yet tho.

At a very high level I'd say:

looks pretty good, packages seem to make sense, and yay etcd in its own package (faster builds of stuff that doesn't depend on it)
high level behavior makes sense, looks correct, and structure make sense for that behavior
couple moderate Qs (some duplicated in inlines):
- why is leadership always resigned after LeaderPeriod, instead of holding it until lost? seems kinda high thrash...
- Start/Stop being both "must be done only once, likely via fx" and "done repeatedly, possibly out of order" feels risky/surprising, may be worth splitting so they can't be confused on a type level. Or at least documenting visibly.
- is it possible for two <-leaderCh == true to occur in a row?

service/sharddistributor/leader/election/election.go

service/sharddistributor/leader/namespace/manager.go

service/sharddistributor/leader/process/processor.go

service/sharddistributor/leader/election/election.go

common/config/config.go

common/log/tag/values.go

go.mod

service/sharddistributor/leader/namespace/manager.go

1. etcd is now a plugin and will be used inside cmd/server and passed into configs. 2. Top level config of shard distributor is removed for now. Will work in a separate PR on a separate copy of the config for Cadence that will initalize config that is passed to the sharddistributorfx. 3. Renamed process.Start/Stop to Run/Terminate to avoid confusing lifecycle. Start/Stop methods are usually linked to fx.Lifecycle. Other internal processes should have different names. 4. Removed expectations of possible double starts/stops. These methods are passed to fx.Lifecycle and should be called only once. Extra calls can point to logic misses. 5. Simplified onResign part in election to make it more readable.

3vilhamster · 2025-05-07T18:25:30Z

Is it possible for two <-leaderCh == true to occur in a row?
No. leaderCh is not useful, but it allows testing mainly.

service/sharddistributor/leader/election/election.go

taylanisikdemir · 2025-05-07T21:58:52Z

service/sharddistributor/leader/process/processor.go

+
+// Terminate halts processing for this namespace
+func (p *namespaceProcessor) Terminate(ctx context.Context) error {
+	if !p.running {


probably concurrent calls is not a concern for this component but we usually make these guard variable reads/writes atomic.

This should be running as a single instance. If someone tries to parallelize this instance, it should panic.
I can add a comment that this is not thread-safe and not expected to be thread-safe.

service/sharddistributor/leader/process/processor.go

[sharddistributor][leaderelection] Introduce leader election mechanism

bc633ad

3vilhamster requested review from Groxx, Shaddoll, davidporter-id-au, demirkayaender, dkrotx, jakobht, neil-xie, sankari165, shijiesheng and taylanisikdemir as code owners May 6, 2025 16:24

3vilhamster mentioned this pull request May 6, 2025

[sharddistributor] Introduce leader election based on ETCD #6875

Closed

Groxx reviewed May 6, 2025

View reviewed changes

go.mod Outdated Show resolved Hide resolved

Groxx reviewed May 6, 2025

View reviewed changes

taylanisikdemir reviewed May 7, 2025

View reviewed changes

3vilhamster added 5 commits May 7, 2025 15:13

remove config usage

3395430

add module to dockerfiles

d8c529b

move conrstutor + invoke to just invok

079ee26

fix etcdstore tests

dd30c55

3vilhamster requested review from Groxx and taylanisikdemir May 7, 2025 14:09

3vilhamster added 2 commits May 7, 2025 20:08

revert changes to the go.mods outside of etcdstore

c32a03a

fix go.work

5c93ed2

fix go.mod version requirement

724c982

taylanisikdemir approved these changes May 7, 2025

View reviewed changes

add delay tag

a4f3300

3vilhamster merged commit 88868c0 into cadence-workflow:master May 8, 2025
23 checks passed

3vilhamster deleted the leaderelection-fx branch May 8, 2025 17:19

[sharddistributor][leaderelection] Introduce leader election mechanism #6889

[sharddistributor][leaderelection] Introduce leader election mechanism #6889

Uh oh!

Conversation

3vilhamster commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Groxx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

3vilhamster commented May 7, 2025

Uh oh!

Uh oh!

taylanisikdemir May 7, 2025

Choose a reason for hiding this comment

Uh oh!

3vilhamster May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3vilhamster commented May 6, 2025 •

edited

Loading

Groxx left a comment •

edited

Loading