Skip to content

Fix reprotect time out bug #412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 3, 2025
Merged

Fix reprotect time out bug #412

merged 15 commits into from
Jul 3, 2025

Conversation

lukeatdell
Copy link
Contributor

@lukeatdell lukeatdell commented Jun 26, 2025

Dependent on #415 and dell/gopowerscale#101

Description

Problem

During a reprotect, if the initial sync of a newly created SyncIQ policy takes longer than the default timeout of five minutes, the context of the reprotect action would time out and set the Replication Group status to Error while the reprotect action was re-queued.
Subsequently, the retry of the reprotect would then fail indefinitely, because the first action of the reprotect flow is to confirm the "Local Target" policy is write-enabled on the storage system that is about to become the new source, and that policy no longer exists because it was deleted by the first reprotect attempt.

Solution

This PR updates the reprotect flow to only confirm the initial sync job has begun, instead of waiting for the full sync to complete.
The function for deriving the Link State has been updated to give a higher priority to the SYNC_IN_PROGRESS state since that state is based on the existence of active sync jobs for the named SyncIQ policy, and further states are derived from the presence or lack of a SyncIQ policy and their associated states.

Now, after the sync job is confirmed to be running, the Link State will be updated to reflect the SYNC_IN_PROGRESS state for the initial sync job.

GitHub Issues

List the GitHub issues impacted by this PR:

GitHub Issue #
dell/csm#1928

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • Backward compatibility is not broken

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Please also list any relevant details for your test configuration

Replication e2e tests:

9 scenarios (9 passed)
48 steps (48 passed)
9m59.605695471s
--- PASS: TestReplication (599.62s)
    --- PASS: TestReplication/Synchronize_data_from_source_to_target_storage_system. (24.94s)
    --- PASS: TestReplication/Suspend_replication_between_two_storage_systems. (24.88s)
    --- PASS: TestReplication/Resume_a_suspended_replication_session_between_two_storage_systems. (64.31s)
    --- PASS: TestReplication/Execute_planned_failover. (65.21s)
    --- PASS: TestReplication/Reprotect_at_the_target_cluster. (115.29s)
    --- PASS: TestReplication/Execute_an_unplanned_failover. (64.98s)
    --- PASS: TestReplication/Execute_failback_from_a_failover_keeping_any_changes_made_on_the_target_cluster. (114.93s)
    --- PASS: TestReplication/Execute_planned_failover_in_preparation_for_the_next_failback_test. (60.19s)
    --- PASS: TestReplication/Execute_failback_from_a_failover_and_discard_any_changes_made_on_the_target_cluster. (64.86s)
PASS
status 0
ok      karavi-testing/karavi-replication/replication-test      664.728s

Kubernetes e2e - External Storage

$ ./e2e.test --ginkgo.timeout=2h --ginkgo.skip='\[Feature:|\[Disruptive\]' --ginkgo.focus='External.Storage.*' --storage.testdriver=./manifest.yaml
  I0702 15:37:33.218885 16383 external.go:189] Driver loaded from path [./manifest.yaml]: &{DriverInfo:{Name:csi-isilon.dellemc.com InTreePluginName: TestTags:[] MaxFileSize:0 SupportedSizeRange:{Max: Min:8Gi} SupportedFsType:map[:{} nfs:{}] SupportedMountOption:map[] RequiredMountOption:map[] Capabilities:map[block:false capacity:false controllerExpansion:true exec:false fsGroup:false multipods:true nodeExpansion:true offlineExpansion:true onlineExpansion:true persistence:true pvcDataSource:true snapshotDataSource:true topology:true volumeLimits:false] RequiredAccessModes:[] TopologyKeys:[csi-isilon.dellemc.com/10.247.102.193] NumAllowedTopologies:0 StressTestOptions:<nil> VolumeSnapshotStressTestOptions:0xc000aed3c0 PerformanceTestOptions:<nil>} StorageClass:{FromName:false FromFile: FromExistingClassName:isilon} VolumeAttributesClass:{FromName:false FromFile: FromExistingClassName:} SnapshotClass:{FromName:false FromFile: FromExistingClassName:isilon-snapclass} InlineVolumes:[] ClientNodeName: Timeouts:map[]}
  I0702 15:37:34.010988 16383 test_context.go:564] The --provider flag is not set. Continuing as if --provider=skeleton had been used.
  I0702 15:37:34.011126   16383 e2e.go:109] Starting e2e run "bf6ec656-08c8-46b4-9da9-3ee0f45726cb" on Ginkgo node 1
Running Suite: Kubernetes e2e suite
===========================================================================
Random Seed: 1751485053 - will randomize all specs

Will run 189 of 7037 specs
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...

Ran 37 of 7037 Specs in 1396.247 seconds
SUCCESS! -- 37 Passed | 0 Failed | 0 Pending | 7000 Skipped
PASS

lukeatdell and others added 9 commits June 26, 2025 16:56
- remove verbosity from unit tests. not needed. failures will be printed.
- remove clean from unit-test target. user can execute `make clean unit-test` if needed.
- add csm-common.mk to gitignore.
- reprotect will now verify the first sync job is running instead of waiting for the job to finish.
- link state status derivation function updated to give SYNC_IN_PROGRESS higher priority to more accurately represent the state.
falfaroc
falfaroc previously approved these changes Jul 2, 2025
falfaroc
falfaroc previously approved these changes Jul 2, 2025
santhoshatdell
santhoshatdell previously approved these changes Jul 2, 2025
- cover scenarios when LastJobState is either Running or Finished
@lukeatdell lukeatdell marked this pull request as draft July 2, 2025 20:09
@lukeatdell lukeatdell marked this pull request as ready for review July 2, 2025 20:40
Copy link

github-actions bot commented Jul 2, 2025

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/dell/csi-powerscale/service 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/dell/csi-powerscale/service/replication.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/dell/csi-powerscale/service/node_test.go
  • github.com/dell/csi-powerscale/service/replication_test.go
  • github.com/dell/csi-powerscale/service/service_test.go

@falfaroc falfaroc self-requested a review July 3, 2025 12:13
@lukeatdell lukeatdell merged commit 19df68e into main Jul 3, 2025
6 checks passed
@lukeatdell lukeatdell deleted the usr/lukeatdell/reprotect-bug branch July 3, 2025 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants