feat: V1 to CHASM scheduler migration via CreateFromMigrationState RPC by chaptersix · Pull Request #9261 · temporalio/temporal

chaptersix · 2026-02-09T17:25:42Z

What changed

AdminService MigrateSchedule RPC signals V1 scheduler workflows to migrate to CHASM (covers both directions; only V1->V2 implemented so far)
CHASM scheduler service exposes CreateFromMigrationState RPC to create a scheduler from migrated V1 state
V1 scheduler workflow receives a migrate-to-chasm signal, runs a local activity that calls CreateFromMigrationState to create the CHASM schedule
CreateSchedulerFromMigration initializes a full CHASM scheduler tree (generator, invoker, backfillers, visibility) from migrated V1 state, preserving the conflict token for client compatibility
LegacyToCreateFromMigrationStateRequest converts V1 InternalState + ScheduleInfo into the migration request format, including running/completed workflows as buffered starts and ongoing backfills
PendingMigration flag is persisted in InternalState proto so failed migrations automatically retry on the next run loop iteration without requiring a new signal
On success, the V1 workflow closes. On failure, it logs, emits schedule_migration_failed metric, and retries next iteration
If a CHASM schedule already exists, migration treats it as success (idempotent)
Metrics emitted at the workflow level: schedule_migration_started, schedule_migration_completed, schedule_migration_failed

Why

Support migrating from workflow-backed (V1) schedulers to CHASM (V2) schedulers. The admin API (MigrateSchedule) signals the V1 workflow, which snapshots its state and creates the V2 schedule in a single local activity.

Migration activity retry policy

The migration local activity uses a single attempt with 5s timeouts (both start-to-close and schedule-to-close). Rather than retrying in a loop within the local activity, the persistent PendingMigration flag ensures the next run loop iteration re-attempts migration.

Signals during migration

Signals received while the migration local activity is executing are dropped if the migration succeeds (the workflow closes without consuming them).

Follow-up items

V2 to V1 migration (rollback path), including CHASM-side pause logic
Incoming migration signal for V1 workflow (migrate-from-chasm)
Sentinel key handling when EnableCHASMSchedulerCreation was previously enabled
Attach completion callbacks to running workflows after migration
Alerting on schedule_migration_failed > 0

Add the infrastructure for migrating schedules from the workflow-backed scheduler (V1) to the CHASM-backed scheduler (V2): - Add MigrateSchedule RPC to CHASM scheduler service proto - Add MigrateScheduleRequest/Response messages with migration state - Implement AdminHandler.MigrateSchedule to signal V1 workflow - Add migrate signal handler in V1 scheduler workflow - Add MigrateSchedule activity to call CHASM scheduler service - Update migration function to accept proto types directly - Wire up SchedulerServiceClient in worker service fx module

service/frontend/admin_handler.go

Add handler and logic in chasm/lib/scheduler to create a CHASM scheduler from migrated V1 state: - CreateSchedulerFromMigration initializes scheduler with migrated state - MigrateSchedule handler uses StartExecution with reject duplicate policy - Tests for migration functionality

chasm/lib/scheduler/scheduler_test.go

service/frontend/admin_handler.go

chasm/lib/scheduler/handler.go

…migration test LegacyToSchedulerMigrationState was returning *SchedulerMigrationState but the MigrateSchedule activity expects *MigrateScheduleRequest. Rename to LegacyToMigrateScheduleRequest and return the full request with NamespaceId populated. Also fix the migrate signal channel (was incorrectly using SignalNameForceCAN instead of SignalNameMigrate), add TestScheduleMigrationV1ToV2 integration test, expose SchedulerClient from test cluster, and fix staticcheck SA4006 lint errors in scheduler_test.go.

service/worker/scheduler/workflow.go

Add activity-level tests for MigrateSchedule covering success, already-exists (idempotent), and error paths. Add workflow-level tests for migrate signal handling: success terminates workflow, failure continues, and signals are still processed after a failed migration. Cap migration local activity to 1 attempt with 60s schedule-to-close timeout instead of inheriting the default 1h with unlimited retries. Remove unnecessary incSeqNo() before migration -- the conflict token change is never visible externally since it's in-memory only, and queued signals are dropped on workflow termination regardless.

Resolved merge conflict in service/worker/fx.go by including both: - dummy.Module from upstream - schedulerpb.NewSchedulerServiceLayeredClient from current branch

Add info log and ScheduleMigrationStarted counter when migration begins in executeMigration(), and ScheduleMigrationCompleted counter when the MigrateSchedule activity succeeds (including idempotent already-exists).

Pauses the V1 scheduler before executing the migration activity to prevent it from starting new workflows during the transition. The original pause state is restored if migration fails. Also adds retries to the migration activity (10 attempts with exponential backoff).

service/worker/scheduler/activities.go

chaptersix · 2026-02-18T03:07:39Z

service/worker/scheduler/activities.go

 }
+
+func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
+	_, err := a.SchedulerClient.MigrateSchedule(ctx, req)


I'm still apprehensive to resume (unpause) schedule if we get an error here.
Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect.

No retries are left

This activity would be marked as failed, v1 scheduler would the resume running as normal.

I'd prefer if we left the v1 schedule paused in the event of failure.

regardless we should have an alert for "migration failure (after all attempts) failed" > 0

after what ever RCA / fix. we can call call the migrate schedule again.

It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.

Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect. No retries are left

See my comment on the retries (we shouldn't retry inline here IMO).

This activity would be marked as failed, v1 scheduler would the resume running as normal.

That's not possible with this sequencing.

1. Scheduler timer fires (or signal is received) in history service 2. A workflow task is scheduled to be picked up by the worker 3. The workflow task is started by the worker, which triggers the `run` loop 4. The run loop pauses the schedule 5. The local activity runs, creating the CHASM schedule 6. Local activity finishes, and the run loop continues (success case)

In the success case, the local activity completes, the worker adds a MarkerRecorded event, which will be sent as part of the next workflow task completion. Our logic then unpauses the scheduler - this also gets written (as the memo block) when we complete that workflow task. The key bit is that there's no delineation that the history service is aware of between our run loop progressing, and our local activity running; you only get the marker recorded in the event history. While the local activity is running, the SDK may refresh the workflow task (to prevent temporal from thinking it's timed out), but otherwise, history has no idea that the worker's running a local activity.

Now, think about the failure case. Let's say that we pause (locally updating our WF memo block), then create the CHASM schedule in that local activity, then our thread immediately dies, as in your example. At this point, history service doesn't even know that we'd paused the schedule (since that's sent as part of the WFT completion)! Temporal doesn't (can't) just resume in execution from where it last left off; all history can do is detect that the workflow task timed out (thread went away), and reschedule it. That rescheduled WFT will cause the scheduler to re-enter the run loop, including any signals that haven't yet been marked as processed (again. by virtue of being included from the SDK as part of a workflow task completion response).

Since the first thing we check in run is always for a pending migration, the rescheduled WFT will go through the exact same migration logic - pause, then attempt migration. Only this time, we'll fast-succeed migration, since we'll detect the CHASM scheduler was already created, and exit. Remember that the V1 scheduler must go sequentially through that run loop, every time, so it's not possible for a V1 scheduler to concurrently be crashed/timed out on migration, and yet still executing the local WFT/progressing further into the run loop.

regardless we should have an alert for "migration failure (after all attempts) failed" > 0

For sure, we'll probably want to get a (low priority) alert for any migration failures.

It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.

Do you mean one intelligent enough to figure out the right stack to delete from? We can delete either from tdbg as-is, you'd just have to specify the right archetype (and prefix with temporal-sys-scheduler: for V1, if you want to target it with tdbg workflow commands).

Thinking more about this - I don't think we actually need to adjust the pause state for a V1 workflow at all, since:

on success, we immediately return from the loop/WFT, so there's no duplicate WF

on explicit migration failure, we want to keep running actions

in the case that our local activity succeeds CHASM create, then crashes, we restart the whole WFT from run; pause never gets written anyways (nor does it need to be, since we'll still try to process the signal).

That said, you may want to add a flag in the memo block to indicate that the schedule is flagged for migration, so that we can retry in subsequent loops (after we've already received the first migration signal). Like moving pendingMigrate to the State protobuf.

In V2->V1 case, we need to do the pause when rolling back, because CHASM tasks can execute concurrently (so even while you're migrating, the Generator might have a task fire and start trying to queue things up).

quick summary: I don't think we have any issue for the "crash" case, which means we also don't need to worry about pausing in V1 (V2 still needs to). We may want to use scheduler's protobuf State for the pendingMigrate flag, so that when we process signals successfully (but fail to migrate, no crash), we can retry in the next run iteration without an additional signal.

chaptersix · 2026-02-18T03:10:08Z

Attaching callbacks to workflows that have already been started is not yet supported so I'll do that in another PR

Add reminders for attaching completion callbacks to migrated running workflows and handling sentinel keys from EnableCHASMSchedulerCreation.

service/frontend/admin_handler.go

service/worker/scheduler/workflow.go

lina-temporal · 2026-02-18T22:41:27Z

service/worker/scheduler/activities.go

 }
+
+func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
+	_, err := a.SchedulerClient.MigrateSchedule(ctx, req)


Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect. No retries are left

See my comment on the retries (we shouldn't retry inline here IMO).

This activity would be marked as failed, v1 scheduler would the resume running as normal.

That's not possible with this sequencing.

1. Scheduler timer fires (or signal is received) in history service 2. A workflow task is scheduled to be picked up by the worker 3. The workflow task is started by the worker, which triggers the `run` loop 4. The run loop pauses the schedule 5. The local activity runs, creating the CHASM schedule 6. Local activity finishes, and the run loop continues (success case)

In the success case, the local activity completes, the worker adds a MarkerRecorded event, which will be sent as part of the next workflow task completion. Our logic then unpauses the scheduler - this also gets written (as the memo block) when we complete that workflow task. The key bit is that there's no delineation that the history service is aware of between our run loop progressing, and our local activity running; you only get the marker recorded in the event history. While the local activity is running, the SDK may refresh the workflow task (to prevent temporal from thinking it's timed out), but otherwise, history has no idea that the worker's running a local activity.

Now, think about the failure case. Let's say that we pause (locally updating our WF memo block), then create the CHASM schedule in that local activity, then our thread immediately dies, as in your example. At this point, history service doesn't even know that we'd paused the schedule (since that's sent as part of the WFT completion)! Temporal doesn't (can't) just resume in execution from where it last left off; all history can do is detect that the workflow task timed out (thread went away), and reschedule it. That rescheduled WFT will cause the scheduler to re-enter the run loop, including any signals that haven't yet been marked as processed (again. by virtue of being included from the SDK as part of a workflow task completion response).

Since the first thing we check in run is always for a pending migration, the rescheduled WFT will go through the exact same migration logic - pause, then attempt migration. Only this time, we'll fast-succeed migration, since we'll detect the CHASM scheduler was already created, and exit. Remember that the V1 scheduler must go sequentially through that run loop, every time, so it's not possible for a V1 scheduler to concurrently be crashed/timed out on migration, and yet still executing the local WFT/progressing further into the run loop.

regardless we should have an alert for "migration failure (after all attempts) failed" > 0

For sure, we'll probably want to get a (low priority) alert for any migration failures.

It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.

Do you mean one intelligent enough to figure out the right stack to delete from? We can delete either from tdbg as-is, you'd just have to specify the right archetype (and prefix with temporal-sys-scheduler: for V1, if you want to target it with tdbg workflow commands).

chasm/lib/scheduler/migration/migration.go

chasm/lib/scheduler/scheduler.go

service/worker/scheduler/workflow_test.go

lina-temporal · 2026-02-18T23:40:36Z

service/worker/scheduler/activities.go

 }
+
+func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
+	_, err := a.SchedulerClient.MigrateSchedule(ctx, req)


Thinking more about this - I don't think we actually need to adjust the pause state for a V1 workflow at all, since:

on success, we immediately return from the loop/WFT, so there's no duplicate WF

on explicit migration failure, we want to keep running actions

in the case that our local activity succeeds CHASM create, then crashes, we restart the whole WFT from run; pause never gets written anyways (nor does it need to be, since we'll still try to process the signal).

That said, you may want to add a flag in the memo block to indicate that the schedule is flagged for migration, so that we can retry in subsequent loops (after we've already received the first migration signal). Like moving pendingMigrate to the State protobuf.

In V2->V1 case, we need to do the pause when rolling back, because CHASM tasks can execute concurrently (so even while you're migrating, the Generator might have a task fire and start trying to queue things up).

lina-temporal · 2026-02-18T23:43:27Z

service/worker/scheduler/activities.go

 }
+
+func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
+	_, err := a.SchedulerClient.MigrateSchedule(ctx, req)


quick summary: I don't think we have any issue for the "crash" case, which means we also don't need to worry about pausing in V1 (V2 still needs to). We may want to use scheduler's protobuf State for the pendingMigrate flag, so that when we process signals successfully (but fail to migrate, no crash), we can retry in the next run iteration without an additional signal.

lina-temporal · 2026-02-18T23:44:43Z

service/worker/scheduler/workflow_test.go

+	s.True(workflow.IsContinueAsNewError(s.env.GetWorkflowError()))
+}
+
+func (s *workflowSuite) TestMigrateFailureThenSignal() {


See comments above re: paused flag, I'm not sure we'll need end up needing this test in its current form (but may want something to assert that the 'pending migration' flag sticks).

- Simplify nil checks in CreateSchedulerFromMigration using protobuf nil-chaining - Set local activity timeouts to 5s with no retries (next run loop iteration will re-attempt) - Fix unchecked SetRootComponent error returns in tests - Update test timings to match no-retry behavior

PendingMigration is now stored in the InternalState proto instead of an ephemeral Go field. Failed migrations automatically retry on the next run loop iteration without requiring a new signal.

CHASM RPC renamed from MigrateSchedule to CreateFromMigrationState. V1 signal renamed from "migrate" to "migrate-to-chasm". V1 activity renamed from MigrateSchedule to MigrateScheduleToChasm. AdminService MigrateSchedule unchanged since it covers both directions.

…Schedule CreateFromMigrationState was returning WorkflowExecutionAlreadyStarted when a CHASM schedule already existed, while CreateSchedule returns AlreadyExists. Align the error types for consistency.

…on path Add newInvokerWithState, newGeneratorWithState, and newBackfillerWithState private constructors so CreateSchedulerFromMigration reuses the same initialization logic (including task scheduling) as the normal path. Also add TODO for CHASM-to-V1 migration support in admin_handler.go.

chaptersix · 2026-03-01T21:35:01Z

revalidate activity policy

The forceCAN flag was only checked when IterationsBeforeContinueAsNew was zero, causing the force-continue-as-new signal to be ignored when a fixed iteration count was configured. Also adds PendingMigration state assertions to migration failure tests, verifying the flag is persisted through continue-as-new.

chaptersix · 2026-03-01T21:43:54Z

chasm/lib/scheduler/handler.go

+// TODO: attach completion callbacks to running workflows migrated from V1.
+// TODO: handle sentinel key that may exist if EnableCHASMSchedulerCreation was
+// enabled when the schedule was originally created. The existing CHASM component
+// must be replaced with the migrated state.


task to eval next run?

chaptersix · 2026-03-01T21:44:53Z

common/metrics/metric_defs.go

 		"schedule_payload_size",
 		WithDescription("The size in bytes of a customer payload (including action results and update signals)"),
 	)
+	ScheduleMigrationStarted = NewCounterDef(


indicate direction

chaptersix · 2026-03-01T21:45:29Z

proto/internal/temporal/server/api/schedule/v1/message.proto


    bool need_refresh = 9;
+
+    bool pending_migration = 11;


might need a "paused before migration bool"

chaptersix · 2026-03-01T22:01:23Z

check pending migration is not set in the internal state that's sent to chasm

Remove unnecessary WithBusinessIDPolicy from CreateFromMigrationState and add sentinel check to match CreateSchedule's error handling. Fix functional test to send a well-formed migration request with all required sub-states populated.

chaptersix commented Feb 9, 2026

View reviewed changes

service/frontend/admin_handler.go Show resolved Hide resolved

chaptersix commented Feb 9, 2026

View reviewed changes

chasm/lib/scheduler/scheduler_test.go Outdated Show resolved Hide resolved

chaptersix commented Feb 9, 2026

View reviewed changes

service/frontend/admin_handler.go Outdated Show resolved Hide resolved

chaptersix commented Feb 9, 2026

View reviewed changes

chasm/lib/scheduler/handler.go Outdated Show resolved Hide resolved

chaptersix added 2 commits February 10, 2026 16:00

Merge branch 'main' into chasm-sch-v1-v2

078bcb2

chaptersix commented Feb 11, 2026

View reviewed changes

service/worker/scheduler/workflow.go Outdated Show resolved Hide resolved

chaptersix commented Feb 11, 2026

View reviewed changes

service/worker/scheduler/workflow.go Show resolved Hide resolved

chaptersix added 3 commits February 11, 2026 08:35

fix: validate identity field in MigrateSchedule admin handler

0b2a255

test: consolidate CHASM migration tests

7628d8f

chaptersix marked this pull request as ready for review February 11, 2026 15:03

chaptersix requested review from a team as code owners February 11, 2026 15:03

chaptersix requested review from dnr and lina-temporal February 11, 2026 22:34

chaptersix added 3 commits February 12, 2026 14:14

chore: merge upstream/main into chasm-sch-v1-v2

c3ada2d

Resolved merge conflict in service/worker/fx.go by including both: - dummy.Module from upstream - schedulerpb.NewSchedulerServiceLayeredClient from current branch

feat: add logging and metrics for V1-to-CHASM schedule migration

5100884

Add info log and ScheduleMigrationStarted counter when migration begins in executeMigration(), and ScheduleMigrationCompleted counter when the MigrateSchedule activity succeeds (including idempotent already-exists).

chaptersix marked this pull request as draft February 18, 2026 02:52

Merge branch 'main' into chasm-sch-v1-v2

c321d29

chaptersix commented Feb 18, 2026

View reviewed changes

service/worker/scheduler/activities.go Outdated Show resolved Hide resolved

chaptersix commented Feb 18, 2026

View reviewed changes

service/worker/scheduler/activities.go Outdated Show resolved Hide resolved

chaptersix commented Feb 18, 2026

View reviewed changes

chore: add TODO comments for MigrateSchedule handler

24a8157

Add reminders for attaching completion callbacks to migrated running workflows and handling sentinel keys from EnableCHASMSchedulerCreation.

lina-temporal reviewed Feb 18, 2026

View reviewed changes

fix: move migration metrics and logs from activity to workflow level

1111823

chaptersix added 4 commits February 23, 2026 19:22

chore: merge main into chasm-sch-v1-v2

e1f6b86

feat: persist pending migration flag in scheduler state

be87ad7

PendingMigration is now stored in the InternalState proto instead of an ephemeral Go field. Failed migrations automatically retry on the next run loop iteration without requiring a new signal.

chaptersix changed the title ~~feat: Add MigrateSchedule RPC for V1 to V2 scheduler migration~~ feat: V1 to CHASM scheduler migration via CreateFromMigrationState RPC Feb 24, 2026

chaptersix added 3 commits February 23, 2026 20:25

fix: correct gci column alignment in workflow.go

24142aa

fix: use AlreadyExists error for duplicate migration, matching Create…

e5b3555

…Schedule CreateFromMigrationState was returning WorkflowExecutionAlreadyStarted when a CHASM schedule already existed, while CreateSchedule returns AlreadyExists. Align the error types for consistency.

chaptersix commented Mar 1, 2026

View reviewed changes

Merge branch 'main' into chasm-sch-v1-v2

f062eae

chaptersix added 2 commits March 1, 2026 19:47

fix: lint errors in migration tests

8efb9f6

Conversation

chaptersix commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Migration activity retry policy

Signals during migration

Follow-up items

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaptersix commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaptersix commented Mar 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaptersix commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaptersix commented Feb 9, 2026 •

edited

Loading