Skip to content

Conversation

@3vilhamster
Copy link
Contributor

@3vilhamster 3vilhamster commented May 6, 2025

What changed?
Introducing new components that run the leader election process inside the shard distributor.

  1. The elector handles the leader election using the leaderStore. It periodically reschedules to spread the load across hosts and uses a random pause to ensure the leader bounces around hosts.
  2. LeaderStore is an abstraction on top of ETCD election API.
  3. The processor allows running leader-elected guarded processes. This is a placeholder for now. It will host 2 processes: shard allocations and shard executors cleanup based on heartbeats.
  4. The manager runs for each namespace, elector, and processor. Processor methods are pushed as callbacks to electors. This allows us to ensure that the process is stopped once the leader is elected and before the leader resigns.

Why?
This mechanism will be used for namespace management processes inside the shard distributor.

How did you test it?
Unit tests/Added option to run tests with ETCD. Will add integration tests later in a separate suite that will have ETCD.

Potential risks
ETCD dependencies might require handling.
This code is not yet used, and integration with sharddistributorfx and integration tests will be added later.

Release notes

Documentation Changes

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping some general thoughts from reading through it. I have not checked the tests yet tho.

At a very high level I'd say:

  • looks pretty good, packages seem to make sense, and yay etcd in its own package (faster builds of stuff that doesn't depend on it)
  • high level behavior makes sense, looks correct, and structure make sense for that behavior
  • couple moderate Qs (some duplicated in inlines):
    • why is leadership always resigned after LeaderPeriod, instead of holding it until lost? seems kinda high thrash...
    • Start/Stop being both "must be done only once, likely via fx" and "done repeatedly, possibly out of order" feels risky/surprising, may be worth splitting so they can't be confused on a type level. Or at least documenting visibly.
    • is it possible for two <-leaderCh == true to occur in a row?

1. etcd is now a plugin and will be used inside cmd/server and passed into configs.
2. Top level config of shard distributor is removed for now. Will work in a separate PR on a separate copy of the config for Cadence that will initalize config that is passed to the sharddistributorfx.
3. Renamed process.Start/Stop to Run/Terminate to avoid confusing lifecycle. Start/Stop methods are usually linked to fx.Lifecycle. Other internal processes should have different names.
4. Removed expectations of possible double starts/stops. These methods are passed to fx.Lifecycle and should be called only once. Extra calls can point to logic misses.
5. Simplified onResign part in election to make it more readable.
@3vilhamster
Copy link
Contributor Author

  • Is it possible for two <-leaderCh == true to occur in a row?
    No. leaderCh is not useful, but it allows testing mainly.


// Terminate halts processing for this namespace
func (p *namespaceProcessor) Terminate(ctx context.Context) error {
if !p.running {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably concurrent calls is not a concern for this component but we usually make these guard variable reads/writes atomic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be running as a single instance. If someone tries to parallelize this instance, it should panic.
I can add a comment that this is not thread-safe and not expected to be thread-safe.

@3vilhamster 3vilhamster merged commit 88868c0 into cadence-workflow:master May 8, 2025
23 checks passed
@3vilhamster 3vilhamster deleted the leaderelection-fx branch May 8, 2025 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants