K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added.#1940
K8SPSMDB-1319 Don't run scheduled backup if DB is unhealthy and startingDeadlineSeconds added.#1940
Conversation
…godb-operator into K8SPSMDB-1319
| errStartingDeadlineExceeded = errors.New("starting deadline seconds exceeded") | ||
| ) | ||
|
|
||
| func (r *ReconcilePerconaServerMongoDBBackup) checkDeadlines(ctx context.Context, cluster *psmdbv1.PerconaServerMongoDB, cr *psmdbv1.PerconaServerMongoDBBackup) error { |
There was a problem hiding this comment.
The receiver on this function is not utilized, do we have a reason to keep it?
| err = cluster.CheckNSetDefaults(ctx, svr.Platform) | ||
| if err != nil { | ||
| return rr, errors.Wrapf(err, "set defaults for %s/%s", cluster.Namespace, cluster.Name) | ||
| err := errors.Wrapf(err, "wrong PSMDB options for %s/%s", cluster.Namespace, cluster.Name) |
There was a problem hiding this comment.
we perform some kind of validations inside the CheckNSetDefaults . Maybe we should adjust the message to invalid cr options used for...
…godb-operator into K8SPSMDB-1319
|
we have conflicts that need to be resolved before reviewing again |
| if cr.Status.State == psmdbv1.BackupStateError && status.State == "" { | ||
| return | ||
| } |
There was a problem hiding this comment.
During implementation, I encountered an issue:
when the starting deadline was exceeded, we correctly set the backup to an error state. However, this state was later overwritten in the defer block with an empty status. As a result, no error was logged and the backup did not stop as expected after the deadline was missed.
2025-05-21T09:55:52.894Z INFO Backup state changed {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"pr-33-30cr"}, "namespace": "pr-33-30cr", "name": "backup1", "reconcileID": "42ed262d-ef9b-4df9-a9cd-5776d148250a", "previous": "error", "current": ""}
There was a problem hiding this comment.
So it looks this :
=== Wed May 28 11:16:20 CEST 2025 ===
NAME CLUSTER STORAGE DESTINATION TYPE SIZE STATUS COMPLETED AGE
backup1 my-cluster-name azure-blob error 19m
backup2 my-cluster-name azure-blob error 4m54s
backup3 my-cluster-name azure-blob 72s
----------------
=== Wed May 28 11:16:25 CEST 2025 ===
NAME CLUSTER STORAGE DESTINATION TYPE SIZE STATUS COMPLETED AGE
backup1 my-cluster-name azure-blob error 19m
backup2 my-cluster-name azure-blob error 5m
backup3 my-cluster-name azure-blob error 78s
2025-05-28T09:18:15.058Z INFO Backup didn't start in startingDeadlineSeconds, failing the backup {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "startingDeadlineSeconds": 20, "passedSeconds": 187.058930347}
2025-05-28T09:18:15.115Z INFO Backup state changed {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup3","namespace":"pr"}, "namespace": "pr", "name": "backup3", "reconcileID": "8e5cd909-a7e8-496d-a3e9-c89088b09005", "previous": "error", "current": ""}
There was a problem hiding this comment.
we got correct error in psmdb-backup object but later change it to empty status.
status:
error: starting deadline seconds exceeded
state: error
| err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name) | ||
| if err := r.setFailedStatus(ctx, cr, err); err != nil { | ||
| return rr, errors.Wrap(err, "update status") | ||
| } | ||
| return reconcile.Result{}, err |
There was a problem hiding this comment.
| err := errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name) | |
| if err := r.setFailedStatus(ctx, cr, err); err != nil { | |
| return rr, errors.Wrap(err, "update status") | |
| } | |
| return reconcile.Result{}, err | |
| return reconcile.Result{}, errors.Wrapf(err, "invalid cr oprions used for %s/%s", cluster.Namespace, cluster.Name) |
| if err := checkStartingDeadline(ctx, cluster, cr); err != nil { | ||
| if err := r.setFailedStatus(ctx, cr, err); err != nil { | ||
| return rr, errors.Wrap(err, "update status") | ||
| } | ||
| return reconcile.Result{}, nil | ||
| } |
There was a problem hiding this comment.
By using the global err it's possible to set an error in the defer block without using the setFailedStatus method
| if err := checkStartingDeadline(ctx, cluster, cr); err != nil { | |
| if err := r.setFailedStatus(ctx, cr, err); err != nil { | |
| return rr, errors.Wrap(err, "update status") | |
| } | |
| return reconcile.Result{}, nil | |
| } | |
| err = checkStartingDeadline(ctx, cluster, cr) | |
| if err != nil { | |
| return reconcile.Result{}, err | |
| } |
There was a problem hiding this comment.
I believe it's better to return Error status immediately and don't wait defer. What do you think?
There was a problem hiding this comment.
Why? Setting the status in the defer or outside of it makes no difference in terms of time
In my opinion, we should follow a consistent approach in the code. We should either always set the failed status in the defer block, or set it right before the return statement, but we shouldn't mix both. By using inconsistent approaches makes the code harder to understand
I would stick to the current approach and not use setFailedStatus method at all. This would also eliminate the need for this check: #1940 (comment)
|
@nmarukovich I wonder how will this work with our current starting deadline logic: Deadline you added is for backups in New state. The logic above is for backups in Starting state. So they can be separate it's OK but error messages are too similar and it'll confuse people. |
Yes, Agree. I will update this message. I believe to avoid confusion we should use the same error message for the same logic in all operators. Why are wondering how it works? Do you have any other comments? |
There was a problem hiding this comment.
a comment to explain why this is needed would be nice
…godb-operator into K8SPSMDB-1319
commit: c1f2e14 |
CHANGE DESCRIPTION
Problem:
Short explanation of the problem.
Cause:
Short explanation of the root cause of the issue if applicable.
Solution:
and the feature startingDeadlineSeconds was added. It should work similar one that we have in PXC
percona/percona-xtradb-cluster-operator@3a4b8f6
We introduce a new field to PerconaServerMongoDBBackup: spec.startingDeadlineSeconds. If this field is set and if backup is not started in configured duration, operator will automatically fail the backup. This duration is checked by comparing the current time with backup job's creation timestamp.
In PXC it works like this:
When we start a cluster and initiate a backup while the cluster is still in the initialization state, the backup fails immediately after the startingDeadlineSeconds timeout is reached.
In PSMDB:
In main:
If we run backup before cluster object created, we failed immediately with the error
error: cluster not foundWhen we launch a cluster and trigger a backup during its initialization phase, the system patiently awaits the cluster's transition to a READY state before assigning a failed status to the backup operation.
Check Timestamps for cluster and backup.
2025-05-21T17:10:18.789Z ERROR failed to make backup {"controller": "psmdbbackup-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBBackup", "PerconaServerMongoDBBackup": {"name":"backup1","namespace":"main-1"}, "namespace": "main-1", "name": "backup1", "reconcileID": "4d40b3aa-726f-466b-91d6-fdc43219cbe2", "backup": "backup1", "error": "failed to runIn branch K8SPSMDB-1319:
When we start a cluster and trigger a backup during its initialization, it behaves like the main branch, failing the backup once the cluster is READY. This happens even with startingDeadlineSeconds set, as the backup fails if the timeout expires before the cluster is ready.
When we wait for the cluster to reach a READY state, then transition it back to the initialization state before triggering a backup, the startingDeadlineSeconds setting functions as expected.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
compare/*-oc.yml)?Config/Logging/Testability