feat: V1 to CHASM scheduler migration via CreateFromMigrationState RPC#9261
feat: V1 to CHASM scheduler migration via CreateFromMigrationState RPC#9261chaptersix wants to merge 24 commits intotemporalio:mainfrom
Conversation
Add the infrastructure for migrating schedules from the workflow-backed scheduler (V1) to the CHASM-backed scheduler (V2): - Add MigrateSchedule RPC to CHASM scheduler service proto - Add MigrateScheduleRequest/Response messages with migration state - Implement AdminHandler.MigrateSchedule to signal V1 workflow - Add migrate signal handler in V1 scheduler workflow - Add MigrateSchedule activity to call CHASM scheduler service - Update migration function to accept proto types directly - Wire up SchedulerServiceClient in worker service fx module
Add handler and logic in chasm/lib/scheduler to create a CHASM scheduler from migrated V1 state: - CreateSchedulerFromMigration initializes scheduler with migrated state - MigrateSchedule handler uses StartExecution with reject duplicate policy - Tests for migration functionality
…migration test LegacyToSchedulerMigrationState was returning *SchedulerMigrationState but the MigrateSchedule activity expects *MigrateScheduleRequest. Rename to LegacyToMigrateScheduleRequest and return the full request with NamespaceId populated. Also fix the migrate signal channel (was incorrectly using SignalNameForceCAN instead of SignalNameMigrate), add TestScheduleMigrationV1ToV2 integration test, expose SchedulerClient from test cluster, and fix staticcheck SA4006 lint errors in scheduler_test.go.
Add activity-level tests for MigrateSchedule covering success, already-exists (idempotent), and error paths. Add workflow-level tests for migrate signal handling: success terminates workflow, failure continues, and signals are still processed after a failed migration. Cap migration local activity to 1 attempt with 60s schedule-to-close timeout instead of inheriting the default 1h with unlimited retries. Remove unnecessary incSeqNo() before migration -- the conflict token change is never visible externally since it's in-memory only, and queued signals are dropped on workflow termination regardless.
Resolved merge conflict in service/worker/fx.go by including both: - dummy.Module from upstream - schedulerpb.NewSchedulerServiceLayeredClient from current branch
Add info log and ScheduleMigrationStarted counter when migration begins in executeMigration(), and ScheduleMigrationCompleted counter when the MigrateSchedule activity succeeds (including idempotent already-exists).
Pauses the V1 scheduler before executing the migration activity to prevent it from starting new workflows during the transition. The original pause state is restored if migration fails. Also adds retries to the migration activity (10 attempts with exponential backoff).
| } | ||
|
|
||
| func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error { | ||
| _, err := a.SchedulerClient.MigrateSchedule(ctx, req) |
There was a problem hiding this comment.
I'm still apprehensive to resume (unpause) schedule if we get an error here.
Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect.
No retries are left
This activity would be marked as failed, v1 scheduler would the resume running as normal.
I'd prefer if we left the v1 schedule paused in the event of failure.
There was a problem hiding this comment.
regardless we should have an alert for "migration failure (after all attempts) failed" > 0
There was a problem hiding this comment.
after what ever RCA / fix. we can call call the migrate schedule again.
It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.
There was a problem hiding this comment.
Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect. No retries are left
See my comment on the retries (we shouldn't retry inline here IMO).
This activity would be marked as failed, v1 scheduler would the resume running as normal.
That's not possible with this sequencing.
1. Scheduler timer fires (or signal is received) in history service
2. A workflow task is scheduled to be picked up by the worker
3. The workflow task is started by the worker, which triggers the `run` loop
4. The run loop pauses the schedule
5. The local activity runs, creating the CHASM schedule
6. Local activity finishes, and the run loop continues (success case)
In the success case, the local activity completes, the worker adds a MarkerRecorded event, which will be sent as part of the next workflow task completion. Our logic then unpauses the scheduler - this also gets written (as the memo block) when we complete that workflow task. The key bit is that there's no delineation that the history service is aware of between our run loop progressing, and our local activity running; you only get the marker recorded in the event history. While the local activity is running, the SDK may refresh the workflow task (to prevent temporal from thinking it's timed out), but otherwise, history has no idea that the worker's running a local activity.
Now, think about the failure case. Let's say that we pause (locally updating our WF memo block), then create the CHASM schedule in that local activity, then our thread immediately dies, as in your example. At this point, history service doesn't even know that we'd paused the schedule (since that's sent as part of the WFT completion)! Temporal doesn't (can't) just resume in execution from where it last left off; all history can do is detect that the workflow task timed out (thread went away), and reschedule it. That rescheduled WFT will cause the scheduler to re-enter the run loop, including any signals that haven't yet been marked as processed (again. by virtue of being included from the SDK as part of a workflow task completion response).
Since the first thing we check in run is always for a pending migration, the rescheduled WFT will go through the exact same migration logic - pause, then attempt migration. Only this time, we'll fast-succeed migration, since we'll detect the CHASM scheduler was already created, and exit. Remember that the V1 scheduler must go sequentially through that run loop, every time, so it's not possible for a V1 scheduler to concurrently be crashed/timed out on migration, and yet still executing the local WFT/progressing further into the run loop.
regardless we should have an alert for "migration failure (after all attempts) failed" > 0
For sure, we'll probably want to get a (low priority) alert for any migration failures.
It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.
Do you mean one intelligent enough to figure out the right stack to delete from? We can delete either from tdbg as-is, you'd just have to specify the right archetype (and prefix with temporal-sys-scheduler: for V1, if you want to target it with tdbg workflow commands).
There was a problem hiding this comment.
Thinking more about this - I don't think we actually need to adjust the pause state for a V1 workflow at all, since:
- on success, we immediately return from the loop/WFT, so there's no duplicate WF
- on explicit migration failure, we want to keep running actions
- in the case that our local activity succeeds CHASM create, then crashes, we restart the whole WFT from
run; pause never gets written anyways (nor does it need to be, since we'll still try to process the signal).
That said, you may want to add a flag in the memo block to indicate that the schedule is flagged for migration, so that we can retry in subsequent loops (after we've already received the first migration signal). Like moving pendingMigrate to the State protobuf.
In V2->V1 case, we need to do the pause when rolling back, because CHASM tasks can execute concurrently (so even while you're migrating, the Generator might have a task fire and start trying to queue things up).
There was a problem hiding this comment.
quick summary: I don't think we have any issue for the "crash" case, which means we also don't need to worry about pausing in V1 (V2 still needs to). We may want to use scheduler's protobuf State for the pendingMigrate flag, so that when we process signals successfully (but fail to migrate, no crash), we can retry in the next run iteration without an additional signal.
|
Attaching callbacks to workflows that have already been started is not yet supported so I'll do that in another PR |
Add reminders for attaching completion callbacks to migrated running workflows and handling sentinel keys from EnableCHASMSchedulerCreation.
| } | ||
|
|
||
| func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error { | ||
| _, err := a.SchedulerClient.MigrateSchedule(ctx, req) |
There was a problem hiding this comment.
Let's say on the 10th attempt to the chasm schedule is able to be created but the success response does not make it. TCP connection drops or something to that effect. No retries are left
See my comment on the retries (we shouldn't retry inline here IMO).
This activity would be marked as failed, v1 scheduler would the resume running as normal.
That's not possible with this sequencing.
1. Scheduler timer fires (or signal is received) in history service
2. A workflow task is scheduled to be picked up by the worker
3. The workflow task is started by the worker, which triggers the `run` loop
4. The run loop pauses the schedule
5. The local activity runs, creating the CHASM schedule
6. Local activity finishes, and the run loop continues (success case)
In the success case, the local activity completes, the worker adds a MarkerRecorded event, which will be sent as part of the next workflow task completion. Our logic then unpauses the scheduler - this also gets written (as the memo block) when we complete that workflow task. The key bit is that there's no delineation that the history service is aware of between our run loop progressing, and our local activity running; you only get the marker recorded in the event history. While the local activity is running, the SDK may refresh the workflow task (to prevent temporal from thinking it's timed out), but otherwise, history has no idea that the worker's running a local activity.
Now, think about the failure case. Let's say that we pause (locally updating our WF memo block), then create the CHASM schedule in that local activity, then our thread immediately dies, as in your example. At this point, history service doesn't even know that we'd paused the schedule (since that's sent as part of the WFT completion)! Temporal doesn't (can't) just resume in execution from where it last left off; all history can do is detect that the workflow task timed out (thread went away), and reschedule it. That rescheduled WFT will cause the scheduler to re-enter the run loop, including any signals that haven't yet been marked as processed (again. by virtue of being included from the SDK as part of a workflow task completion response).
Since the first thing we check in run is always for a pending migration, the rescheduled WFT will go through the exact same migration logic - pause, then attempt migration. Only this time, we'll fast-succeed migration, since we'll detect the CHASM scheduler was already created, and exit. Remember that the V1 scheduler must go sequentially through that run loop, every time, so it's not possible for a V1 scheduler to concurrently be crashed/timed out on migration, and yet still executing the local WFT/progressing further into the run loop.
regardless we should have an alert for "migration failure (after all attempts) failed" > 0
For sure, we'll probably want to get a (low priority) alert for any migration failures.
It would also be useful to have a tdbg command that allows us to delete a schedule on a give stack.
Do you mean one intelligent enough to figure out the right stack to delete from? We can delete either from tdbg as-is, you'd just have to specify the right archetype (and prefix with temporal-sys-scheduler: for V1, if you want to target it with tdbg workflow commands).
| } | ||
|
|
||
| func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error { | ||
| _, err := a.SchedulerClient.MigrateSchedule(ctx, req) |
There was a problem hiding this comment.
Thinking more about this - I don't think we actually need to adjust the pause state for a V1 workflow at all, since:
- on success, we immediately return from the loop/WFT, so there's no duplicate WF
- on explicit migration failure, we want to keep running actions
- in the case that our local activity succeeds CHASM create, then crashes, we restart the whole WFT from
run; pause never gets written anyways (nor does it need to be, since we'll still try to process the signal).
That said, you may want to add a flag in the memo block to indicate that the schedule is flagged for migration, so that we can retry in subsequent loops (after we've already received the first migration signal). Like moving pendingMigrate to the State protobuf.
In V2->V1 case, we need to do the pause when rolling back, because CHASM tasks can execute concurrently (so even while you're migrating, the Generator might have a task fire and start trying to queue things up).
| } | ||
|
|
||
| func (a *activities) MigrateSchedule(ctx context.Context, req *schedulerpb.MigrateScheduleRequest) error { | ||
| _, err := a.SchedulerClient.MigrateSchedule(ctx, req) |
There was a problem hiding this comment.
quick summary: I don't think we have any issue for the "crash" case, which means we also don't need to worry about pausing in V1 (V2 still needs to). We may want to use scheduler's protobuf State for the pendingMigrate flag, so that when we process signals successfully (but fail to migrate, no crash), we can retry in the next run iteration without an additional signal.
| s.True(workflow.IsContinueAsNewError(s.env.GetWorkflowError())) | ||
| } | ||
|
|
||
| func (s *workflowSuite) TestMigrateFailureThenSignal() { |
There was a problem hiding this comment.
See comments above re: paused flag, I'm not sure we'll need end up needing this test in its current form (but may want something to assert that the 'pending migration' flag sticks).
- Simplify nil checks in CreateSchedulerFromMigration using protobuf nil-chaining - Set local activity timeouts to 5s with no retries (next run loop iteration will re-attempt) - Fix unchecked SetRootComponent error returns in tests - Update test timings to match no-retry behavior
PendingMigration is now stored in the InternalState proto instead of an ephemeral Go field. Failed migrations automatically retry on the next run loop iteration without requiring a new signal.
CHASM RPC renamed from MigrateSchedule to CreateFromMigrationState. V1 signal renamed from "migrate" to "migrate-to-chasm". V1 activity renamed from MigrateSchedule to MigrateScheduleToChasm. AdminService MigrateSchedule unchanged since it covers both directions.
…Schedule CreateFromMigrationState was returning WorkflowExecutionAlreadyStarted when a CHASM schedule already existed, while CreateSchedule returns AlreadyExists. Align the error types for consistency.
…on path Add newInvokerWithState, newGeneratorWithState, and newBackfillerWithState private constructors so CreateSchedulerFromMigration reuses the same initialization logic (including task scheduling) as the normal path. Also add TODO for CHASM-to-V1 migration support in admin_handler.go.
|
revalidate activity policy |
The forceCAN flag was only checked when IterationsBeforeContinueAsNew was zero, causing the force-continue-as-new signal to be ignored when a fixed iteration count was configured. Also adds PendingMigration state assertions to migration failure tests, verifying the flag is persisted through continue-as-new.
chasm/lib/scheduler/handler.go
Outdated
| // TODO: attach completion callbacks to running workflows migrated from V1. | ||
| // TODO: handle sentinel key that may exist if EnableCHASMSchedulerCreation was | ||
| // enabled when the schedule was originally created. The existing CHASM component | ||
| // must be replaced with the migrated state. |
There was a problem hiding this comment.
task to eval next run?
| "schedule_payload_size", | ||
| WithDescription("The size in bytes of a customer payload (including action results and update signals)"), | ||
| ) | ||
| ScheduleMigrationStarted = NewCounterDef( |
There was a problem hiding this comment.
indicate direction
|
|
||
| bool need_refresh = 9; | ||
|
|
||
| bool pending_migration = 11; |
There was a problem hiding this comment.
might need a "paused before migration bool"
|
check pending migration is not set in the internal state that's sent to chasm |
Remove unnecessary WithBusinessIDPolicy from CreateFromMigrationState and add sentinel check to match CreateSchedule's error handling. Fix functional test to send a well-formed migration request with all required sub-states populated.
What changed
MigrateScheduleRPC signals V1 scheduler workflows to migrate to CHASM (covers both directions; only V1->V2 implemented so far)CreateFromMigrationStateRPC to create a scheduler from migrated V1 statemigrate-to-chasmsignal, runs a local activity that callsCreateFromMigrationStateto create the CHASM scheduleCreateSchedulerFromMigrationinitializes a full CHASM scheduler tree (generator, invoker, backfillers, visibility) from migrated V1 state, preserving the conflict token for client compatibilityLegacyToCreateFromMigrationStateRequestconverts V1InternalState+ScheduleInfointo the migration request format, including running/completed workflows as buffered starts and ongoing backfillsPendingMigrationflag is persisted inInternalStateproto so failed migrations automatically retry on the next run loop iteration without requiring a new signalschedule_migration_failedmetric, and retries next iterationschedule_migration_started,schedule_migration_completed,schedule_migration_failedWhy
Support migrating from workflow-backed (V1) schedulers to CHASM (V2) schedulers. The admin API (
MigrateSchedule) signals the V1 workflow, which snapshots its state and creates the V2 schedule in a single local activity.Migration activity retry policy
The migration local activity uses a single attempt with 5s timeouts (both start-to-close and schedule-to-close). Rather than retrying in a loop within the local activity, the persistent
PendingMigrationflag ensures the next run loop iteration re-attempts migration.Signals during migration
Signals received while the migration local activity is executing are dropped if the migration succeeds (the workflow closes without consuming them).
Follow-up items
migrate-from-chasm)EnableCHASMSchedulerCreationwas previously enabledschedule_migration_failed > 0