-
Notifications
You must be signed in to change notification settings - Fork 750
Description
Describe the issue
When running Flyte in a multi-cluster environment (separate flyte-controlplane, separate flyte-dataplane), when FlyteAdmin is trying to create a FlyteWorkflow
CRD in a dataplane cluster, sometimes it could encounter errors, e.g:
failed to create workflow in propeller Post "https://kube-apiserver.<domain>/apis/flyte.lyft.com/v1alpha1/namespaces/ft-production/flyteworkflows?timeout=30s": net/http: TLS handshake timeout
It happens when K8sWorkflowExecutor
is sending a POST to the kube-apiserver of the dataplane cluster:
flyte/flyteadmin/pkg/workflowengine/impl/k8s_executor.go
Lines 78 to 84 in 3b621c1
_, err = targetCluster.FlyteClient.FlyteworkflowV1alpha1().FlyteWorkflows(data.Namespace).Create(ctx, flyteWf, v1.CreateOptions{}) | |
if err != nil { | |
if !k8_api_err.IsAlreadyExists(err) { | |
logger.Debugf(context.TODO(), "Failed to create execution [%+v] in cluster: %s", data.ExecutionID, targetCluster.ID) | |
return interfaces.ExecutionResponse{}, errors.NewFlyteAdminErrorf(codes.Internal, "failed to create workflow in propeller %v", err) | |
} | |
} |
And the workflow appears to have failed from the very first attempt, without a good mechanism to retry submission (other than recovering the job manually).
I guess in a multi-cluster setup, network issues are somehow inevitable, so it would be appreciated to have some retry mechanism in there. I saw there's a very comprehensive retry handler for cases where execution node failures occur, but this is happening before that, and there's no retry machinery.
I am happy to implement myself and open a PR, just wanted to hear opinions on that matter. Thanks folks!
What if we do not do this?
Flyte multi-cluster setups would suffer badly from some intermittent issues (esp. related to network)
Related component(s)
flyteadmin
flytepropeller
Are you sure this issue hasn't been raised already?
- Yes
Have you read the Code of Conduct?
- Yes