Skip to content

feat: V1 to CHASM scheduler migration via CreateFromMigrationState RPC#9261

Draft
chaptersix wants to merge 24 commits intotemporalio:mainfrom
chaptersix:chasm-sch-v1-v2
Draft

feat: V1 to CHASM scheduler migration via CreateFromMigrationState RPC#9261
chaptersix wants to merge 24 commits intotemporalio:mainfrom
chaptersix:chasm-sch-v1-v2

Conversation

@chaptersix
Copy link
Contributor

@chaptersix chaptersix commented Feb 9, 2026

What changed

  • AdminService MigrateSchedule RPC signals V1 scheduler workflows to migrate to CHASM (covers both directions; only V1->V2 implemented so far)
  • CHASM scheduler service exposes CreateFromMigrationState RPC to create a scheduler from migrated V1 state
  • V1 scheduler workflow receives a migrate-to-chasm signal, runs a local activity that calls CreateFromMigrationState to create the CHASM schedule
  • CreateSchedulerFromMigration initializes a full CHASM scheduler tree (generator, invoker, backfillers, visibility) from migrated V1 state, preserving the conflict token for client compatibility
  • LegacyToCreateFromMigrationStateRequest converts V1 InternalState + ScheduleInfo into the migration request format, including running/completed workflows as buffered starts and ongoing backfills
  • PendingMigration flag is persisted in InternalState proto so failed migrations automatically retry on the next run loop iteration without requiring a new signal
  • On success, the V1 workflow closes. On failure, it logs, emits schedule_migration_failed metric, and retries next iteration
  • If a CHASM schedule already exists, migration treats it as success (idempotent)
  • Metrics emitted at the workflow level: schedule_migration_started, schedule_migration_completed, schedule_migration_failed

Why

Support migrating from workflow-backed (V1) schedulers to CHASM (V2) schedulers. The admin API (MigrateSchedule) signals the V1 workflow, which snapshots its state and creates the V2 schedule in a single local activity.

Migration activity retry policy

The migration local activity uses a single attempt with 5s timeouts (both start-to-close and schedule-to-close). Rather than retrying in a loop within the local activity, the persistent PendingMigration flag ensures the next run loop iteration re-attempts migration.

Signals during migration

Signals received while the migration local activity is executing are dropped if the migration succeeds (the workflow closes without consuming them).

Follow-up items

  • V2 to V1 migration (rollback path), including CHASM-side pause logic
  • Incoming migration signal for V1 workflow (migrate-from-chasm)
  • Sentinel key handling when EnableCHASMSchedulerCreation was previously enabled
  • Attach completion callbacks to running workflows after migration
  • Alerting on schedule_migration_failed > 0

Add the infrastructure for migrating schedules from the workflow-backed
scheduler (V1) to the CHASM-backed scheduler (V2):

- Add MigrateSchedule RPC to CHASM scheduler service proto
- Add MigrateScheduleRequest/Response messages with migration state
- Implement AdminHandler.MigrateSchedule to signal V1 workflow
- Add migrate signal handler in V1 scheduler workflow
- Add MigrateSchedule activity to call CHASM scheduler service
- Update migration function to accept proto types directly
- Wire up SchedulerServiceClient in worker service fx module
Add handler and logic in chasm/lib/scheduler to create a CHASM
scheduler from migrated V1 state:
- CreateSchedulerFromMigration initializes scheduler with migrated state
- MigrateSchedule handler uses StartExecution with reject duplicate policy
- Tests for migration functionality
…migration test

LegacyToSchedulerMigrationState was returning *SchedulerMigrationState
but the MigrateSchedule activity expects *MigrateScheduleRequest.
Rename to LegacyToMigrateScheduleRequest and return the full request
with NamespaceId populated.

Also fix the migrate signal channel (was incorrectly using
SignalNameForceCAN instead of SignalNameMigrate), add
TestScheduleMigrationV1ToV2 integration test, expose SchedulerClient
from test cluster, and fix staticcheck SA4006 lint errors in
scheduler_test.go.
Add activity-level tests for MigrateSchedule covering success,
already-exists (idempotent), and error paths. Add workflow-level tests
for migrate signal handling: success terminates workflow, failure
continues, and signals are still processed after a failed migration.

Cap migration local activity to 1 attempt with 60s schedule-to-close
timeout instead of inheriting the default 1h with unlimited retries.

Remove unnecessary incSeqNo() before migration -- the conflict token
change is never visible externally since it's in-memory only, and
queued signals are dropped on workflow termination regardless.
@chaptersix chaptersix marked this pull request as ready for review February 11, 2026 15:03
@chaptersix chaptersix requested review from a team as code owners February 11, 2026 15:03
Resolved merge conflict in service/worker/fx.go by including both:
- dummy.Module from upstream
- schedulerpb.NewSchedulerServiceLayeredClient from current branch
Add info log and ScheduleMigrationStarted counter when migration begins
in executeMigration(), and ScheduleMigrationCompleted counter when the
MigrateSchedule activity succeeds (including idempotent already-exists).
Pauses the V1 scheduler before executing the migration activity to
prevent it from starting new workflows during the transition. The
original pause state is restored if migration fails. Also adds retries
to the migration activity (10 attempts with exponential backoff).
@chaptersix chaptersix marked this pull request as draft February 18, 2026 02:52
}

func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
_, err := a.SchedulerClient.MigrateSchedule(ctx, req)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still apprehensive to resume (unpause) schedule if we get an error here.
Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect.

No retries are left

This activity would be marked as failed, v1 scheduler would the resume running as normal.

I'd prefer if we left the v1 schedule paused in the event of failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless we should have an alert for "migration failure (after all attempts) failed" > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after what ever RCA / fix. we can call call the migrate schedule again.

It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect. No retries are left

See my comment on the retries (we shouldn't retry inline here IMO).

This activity would be marked as failed, v1 scheduler would the resume running as normal.

That's not possible with this sequencing.

1. Scheduler timer fires (or signal is received) in history service
2. A workflow task is scheduled to be picked up by the worker
3. The workflow task is started by the worker, which triggers the `run` loop
4. The run loop pauses the schedule
5. The local activity runs, creating the CHASM schedule
6. Local activity finishes, and the run loop continues (success case)

In the success case, the local activity completes, the worker adds a MarkerRecorded event, which will be sent as part of the next workflow task completion. Our logic then unpauses the scheduler - this also gets written (as the memo block) when we complete that workflow task. The key bit is that there's no delineation that the history service is aware of between our run loop progressing, and our local activity running; you only get the marker recorded in the event history. While the local activity is running, the SDK may refresh the workflow task (to prevent temporal from thinking it's timed out), but otherwise, history has no idea that the worker's running a local activity.

Now, think about the failure case. Let's say that we pause (locally updating our WF memo block), then create the CHASM schedule in that local activity, then our thread immediately dies, as in your example. At this point, history service doesn't even know that we'd paused the schedule (since that's sent as part of the WFT completion)! Temporal doesn't (can't) just resume in execution from where it last left off; all history can do is detect that the workflow task timed out (thread went away), and reschedule it. That rescheduled WFT will cause the scheduler to re-enter the run loop, including any signals that haven't yet been marked as processed (again. by virtue of being included from the SDK as part of a workflow task completion response).

Since the first thing we check in run is always for a pending migration, the rescheduled WFT will go through the exact same migration logic - pause, then attempt migration. Only this time, we'll fast-succeed migration, since we'll detect the CHASM scheduler was already created, and exit. Remember that the V1 scheduler must go sequentially through that run loop, every time, so it's not possible for a V1 scheduler to concurrently be crashed/timed out on migration, and yet still executing the local WFT/progressing further into the run loop.

regardless we should have an alert for "migration failure (after all attempts) failed" > 0

For sure, we'll probably want to get a (low priority) alert for any migration failures.

It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.

Do you mean one intelligent enough to figure out the right stack to delete from? We can delete either from tdbg as-is, you'd just have to specify the right archetype (and prefix with temporal-sys-scheduler: for V1, if you want to target it with tdbg workflow commands).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this - I don't think we actually need to adjust the pause state for a V1 workflow at all, since:

  • on success, we immediately return from the loop/WFT, so there's no duplicate WF
  • on explicit migration failure, we want to keep running actions
  • in the case that our local activity succeeds CHASM create, then crashes, we restart the whole WFT from run; pause never gets written anyways (nor does it need to be, since we'll still try to process the signal).

That said, you may want to add a flag in the memo block to indicate that the schedule is flagged for migration, so that we can retry in subsequent loops (after we've already received the first migration signal). Like moving pendingMigrate to the State protobuf.

In V2->V1 case, we need to do the pause when rolling back, because CHASM tasks can execute concurrently (so even while you're migrating, the Generator might have a task fire and start trying to queue things up).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick summary: I don't think we have any issue for the "crash" case, which means we also don't need to worry about pausing in V1 (V2 still needs to). We may want to use scheduler's protobuf State for the pendingMigrate flag, so that when we process signals successfully (but fail to migrate, no crash), we can retry in the next run iteration without an additional signal.

@chaptersix
Copy link
Contributor Author

Attaching callbacks to workflows that have already been started is not yet supported so I'll do that in another PR

Add reminders for attaching completion callbacks to migrated running
workflows and handling sentinel keys from EnableCHASMSchedulerCreation.
}

func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
_, err := a.SchedulerClient.MigrateSchedule(ctx, req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect. No retries are left

See my comment on the retries (we shouldn't retry inline here IMO).

This activity would be marked as failed, v1 scheduler would the resume running as normal.

That's not possible with this sequencing.

1. Scheduler timer fires (or signal is received) in history service
2. A workflow task is scheduled to be picked up by the worker
3. The workflow task is started by the worker, which triggers the `run` loop
4. The run loop pauses the schedule
5. The local activity runs, creating the CHASM schedule
6. Local activity finishes, and the run loop continues (success case)

In the success case, the local activity completes, the worker adds a MarkerRecorded event, which will be sent as part of the next workflow task completion. Our logic then unpauses the scheduler - this also gets written (as the memo block) when we complete that workflow task. The key bit is that there's no delineation that the history service is aware of between our run loop progressing, and our local activity running; you only get the marker recorded in the event history. While the local activity is running, the SDK may refresh the workflow task (to prevent temporal from thinking it's timed out), but otherwise, history has no idea that the worker's running a local activity.

Now, think about the failure case. Let's say that we pause (locally updating our WF memo block), then create the CHASM schedule in that local activity, then our thread immediately dies, as in your example. At this point, history service doesn't even know that we'd paused the schedule (since that's sent as part of the WFT completion)! Temporal doesn't (can't) just resume in execution from where it last left off; all history can do is detect that the workflow task timed out (thread went away), and reschedule it. That rescheduled WFT will cause the scheduler to re-enter the run loop, including any signals that haven't yet been marked as processed (again. by virtue of being included from the SDK as part of a workflow task completion response).

Since the first thing we check in run is always for a pending migration, the rescheduled WFT will go through the exact same migration logic - pause, then attempt migration. Only this time, we'll fast-succeed migration, since we'll detect the CHASM scheduler was already created, and exit. Remember that the V1 scheduler must go sequentially through that run loop, every time, so it's not possible for a V1 scheduler to concurrently be crashed/timed out on migration, and yet still executing the local WFT/progressing further into the run loop.

regardless we should have an alert for "migration failure (after all attempts) failed" > 0

For sure, we'll probably want to get a (low priority) alert for any migration failures.

It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.

Do you mean one intelligent enough to figure out the right stack to delete from? We can delete either from tdbg as-is, you'd just have to specify the right archetype (and prefix with temporal-sys-scheduler: for V1, if you want to target it with tdbg workflow commands).

}

func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
_, err := a.SchedulerClient.MigrateSchedule(ctx, req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this - I don't think we actually need to adjust the pause state for a V1 workflow at all, since:

  • on success, we immediately return from the loop/WFT, so there's no duplicate WF
  • on explicit migration failure, we want to keep running actions
  • in the case that our local activity succeeds CHASM create, then crashes, we restart the whole WFT from run; pause never gets written anyways (nor does it need to be, since we'll still try to process the signal).

That said, you may want to add a flag in the memo block to indicate that the schedule is flagged for migration, so that we can retry in subsequent loops (after we've already received the first migration signal). Like moving pendingMigrate to the State protobuf.

In V2->V1 case, we need to do the pause when rolling back, because CHASM tasks can execute concurrently (so even while you're migrating, the Generator might have a task fire and start trying to queue things up).

}

func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error {
_, err := a.SchedulerClient.MigrateSchedule(ctx, req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick summary: I don't think we have any issue for the "crash" case, which means we also don't need to worry about pausing in V1 (V2 still needs to). We may want to use scheduler's protobuf State for the pendingMigrate flag, so that when we process signals successfully (but fail to migrate, no crash), we can retry in the next run iteration without an additional signal.

s.True(workflow.IsContinueAsNewError(s.env.GetWorkflowError()))
}

func (s *workflowSuite) TestMigrateFailureThenSignal() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above re: paused flag, I'm not sure we'll need end up needing this test in its current form (but may want something to assert that the 'pending migration' flag sticks).

- Simplify nil checks in CreateSchedulerFromMigration using protobuf
  nil-chaining
- Set local activity timeouts to 5s with no retries (next run loop
  iteration will re-attempt)
- Fix unchecked SetRootComponent error returns in tests
- Update test timings to match no-retry behavior
PendingMigration is now stored in the InternalState proto instead of
an ephemeral Go field. Failed migrations automatically retry on the
next run loop iteration without requiring a new signal.
CHASM RPC renamed from MigrateSchedule to CreateFromMigrationState.
V1 signal renamed from "migrate" to "migrate-to-chasm".
V1 activity renamed from MigrateSchedule to MigrateScheduleToChasm.
AdminService MigrateSchedule unchanged since it covers both directions.
@chaptersix chaptersix changed the title feat: Add MigrateSchedule RPC for V1 to V2 scheduler migration feat: V1 to CHASM scheduler migration via CreateFromMigrationState RPC Feb 24, 2026
…Schedule

CreateFromMigrationState was returning WorkflowExecutionAlreadyStarted
when a CHASM schedule already existed, while CreateSchedule returns
AlreadyExists. Align the error types for consistency.
…on path

Add newInvokerWithState, newGeneratorWithState, and newBackfillerWithState
private constructors so CreateSchedulerFromMigration reuses the same
initialization logic (including task scheduling) as the normal path.

Also add TODO for CHASM-to-V1 migration support in admin_handler.go.
@chaptersix
Copy link
Contributor Author

revalidate activity policy

The forceCAN flag was only checked when IterationsBeforeContinueAsNew
was zero, causing the force-continue-as-new signal to be ignored when
a fixed iteration count was configured. Also adds PendingMigration
state assertions to migration failure tests, verifying the flag is
persisted through continue-as-new.
// TODO: attach completion callbacks to running workflows migrated from V1.
// TODO: handle sentinel key that may exist if EnableCHASMSchedulerCreation was
// enabled when the schedule was originally created. The existing CHASM component
// must be replaced with the migrated state.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task to eval next run?

"schedule_payload_size",
WithDescription("The size in bytes of a customer payload (including action results and update signals)"),
)
ScheduleMigrationStarted = NewCounterDef(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicate direction


bool need_refresh = 9;

bool pending_migration = 11;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need a "paused before migration bool"

@chaptersix
Copy link
Contributor Author

check pending migration is not set in the internal state that's sent to chasm

Remove unnecessary WithBusinessIDPolicy from CreateFromMigrationState
and add sentinel check to match CreateSchedule's error handling. Fix
functional test to send a well-formed migration request with all
required sub-states populated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants