[Housekeeping] Flyte mutli-cluster network issues on k8s submission - kube-api-client - allow retries

### Describe the issue

When running Flyte in a multi-cluster environment (separate flyte-controlplane, separate flyte-dataplane), when FlyteAdmin is trying to create a `FlyteWorkflow` CRD in a dataplane cluster, sometimes it could encounter errors, e.g: 

```
failed to create workflow in propeller Post "https://kube-apiserver.<domain>/apis/flyte.lyft.com/v1alpha1/namespaces/ft-production/flyteworkflows?timeout=30s": net/http: TLS handshake timeout
```

It happens when `K8sWorkflowExecutor` is sending a POST to the kube-apiserver of the dataplane cluster:
https://github.com/flyteorg/flyte/blob/3b621c141fe8d0582e099a38e4eeb86881a1a4d5/flyteadmin/pkg/workflowengine/impl/k8s_executor.go#L78-L84

And the workflow appears to have failed from the very first attempt, without a good mechanism to retry submission (other than recovering the job manually). 

I guess in a multi-cluster setup, network issues are somehow inevitable, so it would be appreciated to have some retry mechanism in there. I saw there's a very comprehensive retry handler for cases where execution node failures occur, but this is happening before that, and there's no retry machinery.

I am happy to implement myself and open a PR, just wanted to hear opinions on that matter. Thanks folks! 

### What if we do not do this?

Flyte multi-cluster setups would suffer badly from some intermittent issues (esp. related to network)

### Related component(s)

flyteadmin
flytepropeller

### Are you sure this issue hasn't been raised already?

- [x] Yes

### Have you read the Code of Conduct?

- [x] Yes

	_, err = targetCluster.FlyteClient.FlyteworkflowV1alpha1().FlyteWorkflows(data.Namespace).Create(ctx, flyteWf, v1.CreateOptions{})
	if err != nil {
	if !k8_api_err.IsAlreadyExists(err) {
	logger.Debugf(context.TODO(), "Failed to create execution [%+v] in cluster: %s", data.ExecutionID, targetCluster.ID)
	return interfaces.ExecutionResponse{}, errors.NewFlyteAdminErrorf(codes.Internal, "failed to create workflow in propeller %v", err)
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Housekeeping] Flyte mutli-cluster network issues on k8s submission - kube-api-client - allow retries #6605

Describe the issue

What if we do not do this?

Related component(s)

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Housekeeping] Flyte mutli-cluster network issues on k8s submission - kube-api-client - allow retries #6605

Description

Describe the issue

What if we do not do this?

Related component(s)

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions